Familywise error control in multi-armed response-adaptive trials

Response-adaptive designs allow the randomization probabilities to change during the course of a trial based on cumulated response data, so that a greater proportion of patients can be allocated to the better performing treatments. A major concern over the use of response-adaptive designs in practice, particularly from a regulatory viewpoint, is controlling the type I error rate. In particular, we show that the naive z-test can have an inflated type I error rate even after applying a Bonferroni correction. Simulation studies have often been used to demonstrate error control, but do not provide a guarantee. In this paper, we present adaptive testing procedures for normally distributed outcomes that ensure strong familywise error control, by iteratively applying the conditional invariance principle. Our approach can be used for fully sequential and block randomized trials, and for a large class of adaptive randomization rules found in the literature. We show there is a high price to pay in terms of power to guarantee familywise error control for randomization schemes with extreme allocation probabilities. However, for proposed Bayesian adaptive randomization schemes in the literature, our adaptive tests maintain or increase the power of the trial compared to the z-test. We illustrate our method using a three-armed trial in primary hypercholesterolemia.


Introduction
Clinical trials typically randomize patients using a fixed randomization scheme, where the probabilities of assigning patients to the experimental treatments and control are pre-specified and constant. A common method is to simply use equal randomization to the different arms of the trial. However, such randomization schemes can mean that a substantial proportion of the trial participants will continue to be allocated to treatments that are not the best available, even if interim data indicates that one treatment is likely to be superior. Response-adaptive trials address this concern by adaptively changing the randomization probabilities, so that a greater proportion of patients are allocated to the treatment arm which has a better performance based on the cumulated response data. Hence, as the trial continues and accumulates more data, patients in the trial can benefit from having a higher probability of being assigned to a better treatment.
Many classes of response-adaptive randomization schemes have been proposed in the literature for binary outcomes. Randomization schemes based on urn models (such as the randomized play-thewinner rule (Wei and Durham, 1978)), and adaptive biased coin designs (such as the doubly-adaptive biased coin design (Eisele, 1994)), have been extensively studied, with a comprehensive presentation given by Hu and Rosenberger (2006). Many Bayesian adaptive randomization (BAR) schemes have also been proposed (Thall and Wathen, 2007;Trippa et al., 2012;Yin et al., 2012;Wason and Trippa, 2014), where the randomization probabilities are recursively updated using a Bayesian model for the patient outcomes.
There is also a growing interest in response-adaptive randomization for continuous responses. For example, there are schemes based on doubly-adaptive biased coin designs (Hu and Rosenberger, 2006;Zhang and Rosenberger, 2006;Biswas et al., 2007), urn-based drop-the-loser designs (Ivanova et al., 2006) and bandit-based designs (Smith and Villar, 2017). A comprehensive recent overview is given by Biswas and Bhattacharya (2016). In this paper, our focus is on normally distributed outcomes, which are encountered in many clinical trials. Indeed, 23 out of the 59 multi-arm clinical trials identified in a review by Wason et al. (2014) had a continuous outcome.
A comprehensive discussion of the relative advantages and disadvantages of adaptive versus fixed randomization is beyond the scope of this paper. Indeed, the use of adaptive randomization is a widely discussed and somewhat controversial topic in clinical trials. For binary responses, a number of comparisons (Korn and Freidlin, 2011;Berry, 2011;Thall et al., 2015;Wathen and Thall, 2017) have focused on the BAR scheme proposed by Thall and Wathen (2007). Particularly in the two-arm setting, fixed randomisation appears to be preferable to this scheme in terms of power and the number of treatment failures, except when the number of patients to be treated beyond the trial is small (as in rare diseases) or where there are large treatment differences (Lee et al., 2012;Du et al., 2015).
However, even in the two-arm setting, optimal response-adaptive schemes (i.e. those that target some formal optimality criteria) have been shown to have benefits over fixed randomisation by increasing both power and patient benefit simultaneously (Rosenberger et al., 2001;Rosenberger and Hu, 2004;Tymofyeyev et al., 2007;Bello and Sabo, 2016). In the multi-arm setting, which is the focus of this paper, adaptive randomisation can have further advantages over fixed randomization (Berry, 2011;Wason and Trippa, 2014;Hey and Kimmelman, 2015;Berry, 2015), particularly for more complex trial designs.
Response-adaptive designs also have application outside of the context of clinical trials. For example, multi-arm bandit models are used for market learning in economics (Bergemann and Vlimki, 2006) and to improve modern production systems that emphasize 'continuous improvement' (Scott, 2010). Some of the ethical concerns surrounding adaptive randomization (Hey and Kimmelman, 2015) would not apply in these contexts.
Despite the extensive literature on response-adaptive randomization, relatively few clinical trials have actually used such schemes in practice. One of the first examples, which used a randomized play-the-winner rule, was a trial of extracorporeal membrane oxygenation to treat newborns with respiratory failure (Bartlett et al., 1985). More recent examples include a three-armed trial in untreated patients with adverse karyotype acute myeloid leukemia (Giles et al., 2003), which used BAR. The ongoing I-SPY 2 trial (Park et al., 2016;Rugo et al., 2016), which screens drugs in neoadjuvant breast cancer, also uses BAR as part of its design.
A key concern over using response-adaptive randomization, particularly from a regulatory perspective, is ensuring that the type I error rate is controlled. Indeed, draft regulatory guidance from the U.S. Food and Drug Administration (2010) includes adaptive randomization under a section entitled "Adaptive Study Designs Whose Properties Are Less Well Understood". It then goes on to state that "particular attention should be paid to avoiding bias and controlling the Type I error rate" (Food and Drug Administration, 2010, pg. 27) when using adaptive randomization in trials.
In a multi-arm trial, multiple hypotheses are tested simultaneously by design, which leads to a multiple testing problem. To account for this, testing procedures are used that guarantee strong control of the familywise error rate (FWER), which ensures the maximum probability of making at least one type I error is controlled. For confirmatory trials in particular, demonstrating strong control of the FWER is often required by regulators (Food and Drug Administration, 2010;European Medicines Agency, 2002).
For response-adaptive trials, a rigorous proof of FWER control for a particular design is difficult given the complexities of the treatment allocation process. Hence error control has typically either been demonstrated through simulation studies, or by exploiting the asymptotic structure of the adaptive randomization procedure (Hu and Rosenberger, 2006;Zhu and Hu, 2010). However, neither method provides a guarantee of FWER control, particularly with small sample sizes. Gutjahr et al. (2011) showed how to achieve strong control of the FWER for normally distributed outcomes in a two-stage design incorporating response-adaptive randomization. However, our focus is on general responseadaptive trials, without the necessity of restricting to two stages or having a final stage of equal randomization.
In this paper, we show how to guarantee strong control of the FWER for both fully sequential and block randomized response-adaptive trials, for a large class of adaptive randomization rules. Our proposed procedure works by reweighting the usual z-statistic through an iterative application of the conditional invariance principle. The resulting adaptive test statistic can then be used to test the elementary null hypothesis that a treatment is superior to the control.
The rest of the paper is organised as follows. In Section 2, we describe the proposed method for fully sequential response-adaptive trials with a fixed allocation to the control. This method is then modified for block randomized response-adaptive trials in Section 3, for both a fixed or adaptive control allocation. Simulation studies for the proposed methods are presented in Section 4, and Section 5 gives a case study based on a trial in primary hypercholesterolemia. We conclude with a discussion in Section 6. All proof details can be found in the Appendices.
2 Fully sequential response-adaptive trials 2.1 Trial setting Suppose a trial is conducted to test h > 1 experimental treatments against a common control, using the following design. A total of n patients are allocated to the experimental treatments, and n 0 patients are allocated to the control, where n 0 and n are fixed in advance. Patients are allocated to the different experimental treatments using response-adaptive randomization, where we assume that the randomization rule does not depend on the control information. We also assume the allocation to the control is fixed; that is, the probability of assigning a patient to the control is pre-specified and constant. Maintaining allocation to the control is recommended by the Food and Drug Administration (2010), since it best maintains the power of the trial, and helps address the concern about changing patient characteristics over the course of the trial.
The response-adaptive randomization for the experimental treatments starts with a burn-in pe-riod B, which uses equal randomization to allocate r i > 0 patients to the ith treatment (i = 1, . . . , h), with the r i again fixed in advance. Hence a total of r = h i=1 r i patients are allocated to the experimental treatments during the burn-in period. Let a k denote the treatment allocation for the kth experimental patient (k = 1, . . . , n), where a k = i if the kth patient is allocated to the ith treatment.
Also, let X k denote the efficacy outcome for the kth patient. Similarly, let X 0j denote the efficacy outcomes for the jth patient on the control (j = 1, . . . , n 0 ). We assume that The variance σ 2 is assumed known and, without loss of generality, we set σ 2 = 1. Here δ i represents the incremental benefit of treatment i compared to the control, and is the parameter of interest. Finally, let n i denote the total number of allocations to the ith experimental treatment, including the burn-in period.

Hypothesis testing
The elementary null hypotheses of interest are H i : δ i = 0 against the one-sided alternativesH i : δ i > 0.
We discuss the case when H i : δ i ≤ 0 at the end of Section 2.5. One general method to control for multiple testing is to use the closure principle (Marcus et al., 1976) is greater than z α (1/n I + 1/n 0 ) 1/2 , where n I = i∈I n i and z α is the (1−α) standard normal quantile.
As an alternative to using the closure principle with the test statistic above, we could control the FWER by simply using a Bonferroni correction, or a step-up/step-down procedure such as the Holm procedure. These would only involve calculating test statistics for the h elementary null hypotheses, i.e. calculating T I for I = {i} (i = 1, . . . , h). Hence we present the methodology assuming the closure principle will be used, with the Bonferroni and Holm procedures considered as special cases. We return to this issue in Section 4.

Inflation of the familywise error rate
Since the z-test ignores the adaptive randomization used, it is possible to inflate the FWER. As an example, consider the following adaptive randomization scheme for h = 2 treatments: . This can be viewed as implementing early stopping for efficacy for treatment 1, which is not taken into account using the naïve z-test.
We ran a simulation study to calculate the type I error rate using the above randomization scheme.
We subsequently refer to allocation rules of this type as 'type I error inflator' rules (which clearly would never be used in practice).

Auxiliary design
Working with the actual design of the trial is difficult because the response-adaptive randomization affects the distribution of the usual z-test statistics. Hence for each H I we introduce a simpler design, called the auxiliary design, for which we do know the distribution. The actual trial design can then be viewed as a series of data-dependent modifications of the auxiliary design, where we account for the modifications using the conditional invariance principle. The auxiliary designs are purely hypothetical, and are only used to construct the modified tests for the actual design. As well, the allocations in the auxiliary designs are fixed before the start of the actual trial.
The auxiliary design for hypothesis H I is as follows. As in the actual design, a total of n patients are allocated to the experimental treatments, and n 0 patients are allocated to the control. The allocations and responses to the control treatment are the same as the actual design. For the patients allocated to the experimental treatments, the auxiliary design starts with a burn-in period B with r patients that is identical to the actual design. The subsequent n − r − 1 allocations are given by a fixed sequence (b r+1 , . . . , b n−1 ), which can be chosen arbitrarily. These allocations can be considered as a 'guess' of a likely allocation sequence of the actual trial design. One possibility would be to randomize equal numbers of patients for each treatment. The final allocation b n must be to one of the treatments in I.
We now introduce some notation for the auxiliary design. Let n i = n j=1 1 {b j =i} denote the total number of allocations to the ith experimental treatment. Also let m i,k = n j=k 1 {b j =i} denote the total number of allocations to the ith treatment for patients (k, k + 1, . . . , n). We define n I = i∈I n i and m I,k = i∈I m i,k . Under the auxiliary design, n i is fixed for all i, and hence under H I , the usual z-statistic is normally distributed with mean zero and variance (1/n I + 1/n 0 ). Hence we reject H I if T I is greater than z α (1/n I + 1/n 0 ) 1/2 .

Adaptive test statistic
Adaptive designs, such as the trial being considered, follow a common conditional invariance principle in order to control the type I error rate (Brannath et al., 2007). For our response-adaptive trial in question, we apply the conditional invariance principle sequentially, where each step considers the next patient recruited into the trial. Below we give the test statistic for testing hypothesis H I under the actual design, given that the allocation is fully sequential. The proof of Theorem 2.1 can be found in Appendix A.
Theorem 2.1. Under H I , the following test statistic is normally distributed with mean 0 and variance (1/n I + 1/n 0 ): n−1 , m 0,1 , m 0,2 ) (j = m 0,1 + 1, . . . , n 0 ) m 0,1 + m 0,2 = n 0 , m 0,1 > 0, m 0,2 > 0 Hence we reject H I ifT I is greater than z α (1/n I + 1/n 0 ) 1/2 . In Appendix B, we give some simple numerical examples of how the weights change over the course of a trial. In practice, to keep the weights as close to the natural weight n 0 for as many of the control observations as possible, we recommend setting m 0,1 = n 0 − 1 and m 0,2 = 1, as used for the simulation studies in Section 4.1.
In all of the scenarios that we have investigated, the weights w (I) k for the experimental treatments have been positive. Hence in these cases, the test procedure also controls the FWER for the composite null hypotheses H i : δ i ≤ 0. To see this, suppose the elementary null hypotheses are H * i : Under H * I , we can rewrite the distribution of the responses X * k as whereT * I andT I are the adaptive test statistics for H * I and H I respectively.
3 Block randomized response-adaptive trials

Trial setting
It may not be feasible or desirable to randomize patients one-by-one in a fully sequential manner.
Instead one can use block randomization, where after the burn-in period B, patients are adaptively randomized to the experimental treatments in blocks of size (d 1 , . . . , d J ) over J stages, with J j=1 d j = n. The randomization of the jth block depends on the data up to block (j − 1), as well as any external information available at the time. Defining d 0 = 0, let D l = l j=0 (r + d j ) for l = 0, . . . , J, which represents the total number of allocations by the end of lth block, with the zeroth block corresponding to the burn-in period. For notational convenience, we let D −1 = 0. The allocation to the control is again assumed to be fixed throughout the trial.
Due to the block structure of the trial, we can relax the assumption that the randomization rule used for the experimental treatments does not depend on the control information. This is achieved by splitting up the n 0 patients allocated to the control into blocks. More explicitly, suppose that during the burn-in period, r 0 > 0 patients are allocated to the control, where r 0 is fixed in advance.
Subsequently, in the jth block, d 0j patients are allocated to the control, where J j=1 (r 0 + d 0j ) = n 0 .
We assume that for the final block d 0J > 1.
The response-adaptive randomization at block l may now depend on the control information available at the end of block (l−1); that is, the outcome data available from the first l−1 j=1 (r 0 +d 0j ) patients allocated to the control. For notational convenience, define d 00 = 0 and let D 0,l = l j=0 (r 0 + d 0j ) (l = 0, . . . , J), which represents the total number of allocations to the control by the end of lth block.
To control the FWER, we can modify the approach described in Section 2 to account for the block structure. As before, we have an auxiliary design for the patients on the experimental treatments, but now in step l of the process (l ∈ {1, . . . , J}) the actual design is a data-dependent modification of all the allocations for the patients in block l. Hence the weights for the observations in each block will be the same, and are updated block-by-block.

Auxiliary design and adaptive test statistic
The auxiliary design for an intersection hypothesis H I is the same as described in Section 2.4, except that we now impose a block structure on the auxiliary assignments to the experimental treatments.
As before, the auxiliary and actual designs are identical during the burn-in period B, and we require b n ∈ I. For the auxiliary design, let n i denote the total number of allocations to the ith treatment (i = 1, . . . , h), including the burn-in period. Also let denote the total number of allocations to the control and ith treatment respectively for patients in blocks (j, j + 1, . . . , J). We define n I = i∈I n i and m I,j = i∈I m i,j .
We apply the conditional invariance principle block-by-block, where each step considers an additional block of patients recruited into the trial. This gives the following test statistic for testing H I , with a proof and the formulae for the weights given in Appendix C.
Theorem 3.1. If m I,J > 0 then under H I , the following test statistic is normally distributed with mean 0 and variance (1/n I + 1/n 0 ) : Corollary 3.2. If m I,J = 0, then let n 0,J,1 + n 0,J,2 = m 0,J , where n 0,J,1 , n 0,J,2 > 0. Under H I , the following test statistic is normally distributed with mean 0 and variance (1/n I + 1/n 0 ): We reject H I ifT I is greater than z α (1/n I + 1/n 0 ) 1/2 . In order to keep the weights as close to the natural weight n 0 for as many of the control observations as possible, we recommend setting n 0,J,1 = m 0,J − 1 and n 0,J,2 = 1, as used for the simulation studies in Section 4.2. In all of the scenarios that we have investigated, the weights w (I) j for the experimental treatments have all been positive. Hence in these cases, the test procedure also controls the FWER for the composite null hypotheses H i : δ i ≤ 0.

Extension for adaptive control allocations
Thus far, we have assumed that the allocations to the control follow some fixed scheme. We now relax this assumption in the block-randomized setting. Since the form of the adaptive test statisticT I is similar to the one presented above, the formula forT I can be found in Appendix D. Note that it is possible the procedure will fail to give a valid test statistic in this setting, as shown in Appendix E.1.

Simulation studies
As we have already seen in Section 2.3, using the closure principle with the usual z-test does not strongly control the FWER. An alternative method of control is to use the Bonferroni correction on the elementary null hypotheses H 1 , . . . , H h . We also consider the Holm procedure, which is a stepdown procedure that is uniformly more powerful than Bonferroni (Holm, 1979). An advantage of both these procedures is that only h test statistics are calculated, rather than (2 h − 1) test statistics when using the closure principle. This motivates also applying the Holm procedure to the p-values derived from the adaptive test statisticsT i for i = 1, . . . , h. More precisely, we use the adjusted p-values To distinguish between the different methods, we call our proposed procedure that uses the closure principle the 'adaptive closed test'. Similarly, applying the closure principle to the usual z-test gives the 'closed z-test'. Applying the Holm procedure to our adjusted p-values gives the 'Holm adaptive test', while applying the Holm procedure to the usual p-values gives the 'Holm z-test'. In our simulation studies, we compare the different methods primarily by looking at the FWER. However, clearly another key consideration is the power of the different tests. To keep the comparisons simple, and as a similar measure to the FWER, we present results for the disjunctive power, which is the probability of rejecting at least one false null hypothesis.

Fully sequential randomization
We first consider a fully sequential response-adaptive trial, as presented in Section 2, with m = 50 patients allocated to the experimental treatments after the burn-in and n 0 = 60/h patients allocated to the control. In the burn-in period, five patients are allocated to each of the experimental treatments.
We set α = 0.05 and the true control mean µ = 0 for simplicity. We compare the methods under two randomization schemes described below.
Type I error inflator : For h = 2 treatments, this is the same randomization scheme as presented in Section 2.3. For h = 3 treatments, if k j=1 (1 {a j =1} X j /n 1k ) > 0.5, then we randomize patient (k + 1) to treatments 2 and 3 with equal probability.

BAR:
The efficacy outcome for the ith experimental treatment follows a N (µ i , 1) distribution.
For simplicity, we assign independent normal priors to the µ i , so that µ i ∼ N (µ i,0 , σ 2 i,0 ), and let After observing the efficacy outcomes x = (x 1 , . . . , x K ) for the first K patients, the posterior for µ i is as follows: We use a suggested BAR scheme of Yin et al. (2012). For h = 2 experimental treatments, the randomization probabilities (π 1 , 1 − π 1 ) after observing the Kth patient are: For h > 2 experimental treatments, we first obtain the average of the posterior meansμ = 1 h h i=1 µ i . The randomization probabilities π i after observing the Kth patient are: In our simulations, for simplicity we set the priors µ i,0 = 0 and σ 2 i,0 = 1, while τ = 0.5.
Simulation results: Table 1 gives the results for the type I error inflator randomization scheme, while Table 2 gives the results for BAR. The auxiliary designs in all scenarios were simply (m − 1) random draws from a discrete uniform distribution on {1, . . . , h}.
Looking first at the results for the type I error inflator in Table 1, the closed z-test does not control the FWER in any of the scenarios where at least one null hypothesis is false, with an error rate as high as 10.3% in scenario 2. Applying the Holm procedure to the z-test does not control the FWER, and actually increases the error rate in some scenarios (such as 1 and 4). Applying the Bonferroni correction to the z-test also does not control the FWER, as can be seen in the scenarios where all null hypotheses are true. This may appear surprising at first, but the inflation occurs because the naïve z-test is not a valid level-α test for each elementary hypothesis. In contrast, both the adaptive closed test and the Holm adaptive test strongly control the FWER.    As for the power of the different methods, when at least one of the null hypotheses is true (as in scenarios 2, 5, 6 and 7), the Holm z-test has substantially higher power than the closed z-test.
Indeed, the power more than doubles in all four scenarios, and even more than triples in scenario 5.
This dramatic increase in power demonstrates that in these scenarios, the closed z-test is not very sensitive. This is because the test statistic for H I will be 'diluted' by the contribution from responses belonging to the null hypotheses (H i ) i∈I that are true. It is only when all of the null hypotheses are false, as in scenarios 3 and 8, that the power of the closed z-test is reasonable, with a slightly higher power than the Holm z-test.
As for the adaptive tests, the adaptive closed test has a slightly lower power than the closed ztest for all scenarios, with an absolute decrease of between 4.1% in scenario 5 and 7.5% in scenario 3. However, the Holm adaptive test has a substantially lower power than the Holm z-test, with the latter having more than double the power. This demonstrates the high cost in terms of power that controlling the FWER can incur for this randomization scheme. We return to this issue in Section 4.3.
Turning to the BAR scheme in Table 2, this time all of the methods strongly control the FWER.
All methods are slightly conservative, with the adaptive closed test being generally the closest to the nominal 5% level. The Bonferroni-corrected z-test is noticeably more conservative than all the other methods, particularly when there are three treatments. In terms of disjunctive power, if at least one of the null hypotheses are true, we again see that the closed tests suffer from reduced power compared to the Holm versions. However, with BAR the loss of power is less dramatic, with a maximum of a 33% relative decrease in power in scenario 5, but with much smaller decreases in scenarios 2 and 7 for example. This time, the adaptive closed test has almost the same power as the closed z-test, losing a maximum of only 1.4% in scenario 8. In addition, the Holm adaptive test and Holm z-test now have comparable power, with a maximum loss of only 1.9% in scenarios 6 and 7. This indicates that for BAR schemes, the adaptive tests do not lose out very much in terms of power.

Block randomization with a fixed control allocation
We now consider block randomized trials with a fixed control allocation, as presented in Section 3.1.
We use the setup of a trial with J = 3 blocks, with sizes (40, 40, 40) for the experimental treatments and (20, 20, 20) for the control. In the burn-in period, five patients are allocated to each of the treatments including the control. We set the true control mean µ = 0, and α = 0.05. We compare the methods under the randomization schemes below.
Type I error inflator : The allocation probabilities for block j ∈ {1, . . . , J − 1}, patient k = D j + 1, . . . , D j+1 and treatment l ∈ {2, . . . , h} are: BAR: The efficacy outcome for the ith treatment follows a N (µ i , 1) distribution. For notational convenience, let µ 0 = µ; that is, the mean of the control. We assign independent normal priors to . At stage (j + 1), when the efficacy outcomes x = (x 1 , . . . , x D j ) have been observed, the posterior for µ i is as follows: We use a similar BAR scheme to the one in Wason and Trippa (2014). If there are h experimental treatments, the randomization probabilities (π 1 , . . . , π h ) for the experimental treatments at the (j + 1)th stage are: In our simulations, for simplicity we set the priors µ i,0 = 0 and σ 2 i,0 = 1, while γ = 0.5.
Simulation results: Table 1 gives the results for the type I error inflator randomization scheme, while Table 2 gives the results for BAR. The auxiliary designs in all scenarios were simply random draws from a discrete uniform distribution on {1, . . . , h}.
The results are broadly similar to those for the fully sequential setting presented in Section 4.1.
For the type I error inflator, we see that the closed z-test does not control the FWER in general (as seen in scenarios 2, 6 and 7), and neither does applying the Holm procedure to the z-test. The Bonferroni-corrected z-test has an inflated FWER when all null hypotheses are true, as in scenarios 1 and 4. In contrast, the adaptive tests strongly control the FWER in all scenarios. However, again this comes at the cost of reduced power. There is a slight reduction in power between the closed z-test and the closed adaptive test, of between 3 − 4% in absolute terms. In scenarios where at least one null hypothesis is true, the Holm z-test has a much higher power than the Holm adaptive test, with the power more than doubling in these scenarios, and actually tripling in scenario 6.
As for the BAR scheme, all of the methods strongly control the FWER. This time, for some scenarios the adaptive closed test basically achieves the nominal 5% level, as in scenarios 2 and 6.
When there are three treatments, the Bonferroni-corrected z-test can again be overly conservative, as in scenarios 6 and 7. In contrast to the fully sequential setting, with block randomization we see that the adaptive tests actually have the highest power out of all the methods in all scenarios except scenario 2. When at least one null hypothesis is true, the Holm adaptive test has the highest power, while when all null hypotheses are false the adaptive closed test has the highest power. The power gains are small, but demonstrate that we do not always lose out in terms of power when using the proposed adaptive tests.
Block randomization with an adaptive control allocation: In Appendix E.1, we present a simulation study considering block randomization with an adaptive control allocation, as presented in Section 3.3.
The results are broadly similar to those presented above.

Summary
In summary, the simulation results show that in the randomization settings considered, our proposed adaptive tests strongly control the FWER, as would be expected from theory. In contrast, the various z-tests can all fail to control the error rate, as seen in the results for the type I error inflator. However, given a more realistic randomization scheme, such as the BAR schemes we considered, the z-tests achieve strong familywise error control. As for disjunctive power, we see that when at least one null hypothesis is true, the closed tests suffer a very large drop in power compared to the Holm versions. This is because of the 'dilution' of the test statistic as mentioned in Section 4.1. However, when all the null hypotheses are true, then the closed test has the higher power, although the gains are at most modest.
The adaptive tests can pay a large price in terms of power when compared with the z-tests, as seen in the results for the type I error inflator. In Appendix E.2, we give an additional simulation study with two treatments, where the randomization scheme used is simply a fixed allocation to the experimental treatments but with unequal randomization probabilities. We show that when the probability of assignment to treatment 2 is low (i.e. less than 0.2), there is a large drop in the power of the adaptive tests for testing H 1 . This explains what is happening with the type I error inflator when δ 1 = 0, where in the majority of trial scenarios, apart from the unlikely event that treatment 1 stops early for 'efficacy', the probability of assignment to treatment 2 is zero by design. Hence, the type I inflator is in fact close to a worst-case scenario for the adaptive tests.
However, most adaptive randomization schemes are unlikely to have such extreme imbalances.
Indeed, authors such as Korn and Freidlin (2011) recommend restricting the probability of arm assignment to between 0.2 and 0.8 in order to prevent extreme patient allocation. Hence, for 'sensible' adaptive randomization schemes with such a restriction, we would not expect there to be a substantial loss of power when using the Holm adaptive test compared with the Holm z-test, particularly in the block randomized setting.

Case study
Finally, we illustrate our proposed methodology using an example based on a phase II placebocontrolled trial in primary hypercholesterolemia (Roth et al., 2012). The purpose of the study was to compare the effects of using the SAR236553 antibody with high-dose or lose-dose atorvastatin, as compared with high-dose atorvastatin alone. The primary outcome was the least-squares mean percent reduction from baseline of low-density lipoprotein cholesterol (LDL-C). Patients were randomly assigned, in a 1:1:1 ratio, to receive 80 mg of atorvastatin plus placebo, 10 mg of atorvastatin plus SAR236553, or 80 mg of atorvastatin plus SAR236553. For convenience, we label these different interventions as the 'control', 'low dose' and 'high dose' respectively.
In the trial, the observed least-squares mean ± SE percent reduction from baseline in LDL-C was We use the BAR scheme of Section 4.2, with priors µ i,0 = 5 and σ 2 i,0 = 1 (i = 0, 1, 2), while γ = 0.5. Table 5 shows the results for a simulated trial with the above parameters, where the BAR scheme allocated 13 patients to the low dose and 32 patients to the high dose after the burn-in period. This yields the natural weights used in the naïve z-test of n 1 = 21 for the low dose and n 2 = 40 for the high dose. The natural weight for the control is n 0 = 31 by design. The auxiliary design randomly assigned 44 patients to the low or high dose in a 1:1 ratio, and allocated 21 patients to the low dose and 23 patients to the high dose. The adaptive test statistic is slightly smaller than the z-test statistic for the low dose, while the converse is true for the test statistics for the high dose. Looking at the adaptive weights for the burn-in period and the three blocks, we see that for the low dose, the weights for the low dose decrease for each block while the control weights increase. This pattern is reversed for the high dose. Given that all the p-values are less than 0.001, using either the z-test or the adaptive test we would conclude that adding the SAR236553 antibody to high-dose or low-dose atorvastatin leads to a statistically significant reduction in LDL-C levels.

Discussion
A major regulatory concern over the use of response-adaptive trials in clinical practice has been ensuring control of the type I error rate. We have proposed procedures that guarantee strong familywise error control in the following multi-armed trial settings: 1. Fully sequential response-adaptive trials with a fixed control allocation (where the randomization rule does not depend on the control information) 2. Block-randomized response-adaptive trials with a fixed control allocation 3. Block-randomized response-adaptive trials including an adaptive control allocation These procedures are applicable to a large class of response-adaptive randomization rules, particularly in settings (2) and (3) where there are no restrictions on the rule used. Hence both Bayesian and 'optimal' response-adaptive randomization schemes proposed in the literature can be used without adjustment, with only the final test statistic having to be modified.
In practice, to control the FWER we would recommend using the Holm adaptive test. Importantly, it has a much higher power than the adaptive closed test when at least one of the null hypotheses are true. As well, it only requires h hypothesis tests as compared with (2 h − 1) hypothesis tests for the adaptive closed test.
Our adaptive tests lead to unequal weightings of patients, which may be controversial (Burman and Sonesson, 2006). One solution is to use the so-called 'dual test', and reject a hypothesis only if both the adaptive test and the naïve z-test rejects (Denne, 2001;Posch et al., 2003;Chen et al., 2004), although this comes at the cost of reduced power.
We have assumed that the variances of the control and experimental treatments are known.
Fully accounting for unknown variances would add considerable complexity to our approach. In Appendix E.3, we show that estimating the common variance from the data does not inflate the FWER when using the Holm adaptive test, for any of the simulation scenarios considered in this paper.
Our proposed procedures are designed for normally-distributed outcomes, and it would be useful to apply our approach to binary outcomes as well. As a starting point, it may be possible to use the asymptotically normal test statistic for contrasting each treatment arm with the control (Jennison and Turnbull, 2000;Wason and Trippa, 2014), particularly in the block randomised setting.
Finally, although we did not explicitly consider it in this paper, the adaptive randomization procedures used could also incorporate covariate information, so that the allocation probabilities vary across patients with different covariates. These covariate-adjusted response-adaptive randomization schemes are particularly useful when certain characteristics of the patients may be correlated with the primary outcome (Hu and Rosenberger, 2006). A related setting would be biomaker-guided response-adaptive trials, such as I-SPY 2.
Appendix A: Derivation of the weights for familywise error control in fully sequential response-adaptive trials Below is a diagrammatic representation of the assignments and observations for the auxiliary design compared to the actual design for the patients on the experimental treatments: Actual design a 1 · · · a r a r+1 a r+2 · · · a n X 1 · · · X r X r+1 X r+2 · · · X n B Auxiliary design where b k = a k , Y k = X k (k = 1, . . . , r) and b n ∈ I by design.
Step 1 In step 1 we only consider the first response-adaptive allocation a r+1 . We view the auxiliary and actual trials as coming from a two-stage design, where the first stage for both is the burn-in period B, as shown below.
Auxiliary design (step 1) Stage 1 a 1 · · · a r Stage 2 b r+1 b r+2 · · · b n X 1 · · · X r Y r+1 Y r+2 · · · Y n B Actual design (step 1) Stage 1 a 1 · · · a r Modified stage 2 a r+1 b r+2 · · · b n X 1 · · · X r X r+1 Y r+2 · · · Y n B Given the interim data from B, we can determine the actual allocation a r+1 . Indeed, a r+1 needs to be known in order for the trial to continue. Hence the second stage for the actual design in step 1 is a data-dependent modification of the auxiliary design, where the allocation b r+1 is set to a r+1 . At this step, all other allocations for the actual design remain the same as the auxiliary design. The modification to the second stage in the actual design can only depend on data available at the end of the interim stage; that is, the burn-in period. Hence when considering a fully sequential responseadaptive scheme, we cannot adapt b k to a k for k > r + 1 at this step, and can only consider the modification b r+1 to a r+1 .
Under the auxiliary two-stage design, the test statistic T I = T (1) I + T (2) I for the experimenal treatments is decomposed into two parts, where T (1) I is calculated from the first stage data and T (2) I is calculated from the second stage data. More explicitly, Since the control data are independent of the design adaptations, we can consider these data as coming from the second stage of the actual and auxiliary designs.
Using the conditional invariance principle, we seek a statisticT has the same distribution as T To match the conditional distributions we equate the conditional means and variances to give where λ r+1 = m I,r+1 /n I − 1 and η r+1 = m I,r+1 /(n I ) 2 + 1/n 0 . Hence the full modified statistic for the actual design in step 1 is where we define w (I) k = n I (k = 1, . . . , r), and the (r + 1) subscript onT I,r+1 indicates that this is the modified test statistic for the actual design after the first (r + 1) patients. By the conditional invariance principle,T I,r+1 is a valid test statistic for the actual design.
Step 2 In step 2, we take the actual design from step 1 as the new auxiliary design. This means that the modified test statisticT I,r+1 , as defined above, is also taken forward from step 1 and is the valid test statistic for the new auxiliary design. We again view the auxiliary and actual trials as two-stage designs, where this time the first stage is the data from the first (r + 1) patients, as shown below.

Actual design (step 2)
Stage 1 a 1 · · · a r a r+1 Modified stage 2 Here the second stage for the new actual design is a modification of the new auxiliary design where the allocation b r+2 is set to a r+2 .
Under the auxiliary two-stage design, the test statisticT I,r+1 for the experimental treatments is decomposed into the statistics calculated from the first and second stage data,T I,r+1 =T (1) where nowT (1) Following the conditional invariance principle like in step 1, we seek weights w has the same distribution asT (2) I,r+1 conditional on the interim data D (1) . Matching the conditional distributions by equating the conditional means and variances gives where λ r+2 = m I,r+2 Hence the full modified statistic for the actual design in step 2 is This statistic is taken forward to the next step of the process as the valid test statistic for the new auxiliary design.

Inductive step
We now repeat the process above, at each step taking forward the actual design as the new auxiliary design. The actual design at step l of the process (l = 2, . . . , n − r − 1) is a modification of the new auxiliary design where b r+l is set to a r+l . The valid test statistic for the new auxiliary design is T I,r+l−1 , taken forward from the previous step of the process, where we provide an explicit expression for the test statistics shortly. The diagrammatic representation of step l of the process is given below.

Auxiliary design (step l)
Stage 1 a 1 · · · a r a r+1 a r+2 · · · a r+l−1 Actual design (step l) Stage 1 a 1 · · · a r a r+1 a r+2 · · · a r+l−1 Modified stage 2 Using these auxiliary and actual designs, we select new weights w The corresponding test statisticsT I,r+k for k = 1, . . . , l are:

Final step
In the final step of the process, the second stage data for the auxiliary and actual designs is a single allocation: Auxiliary design (final step) a 1 · · · a r a r+1 a r+2 · · · a n−1 b n X 1 · · · X r X r+1 X r+2 · · · X n−1 Y n B Actual design (final step) a 1 · · · a r a r+1 a r+2 · · · a n−1 a n X 1 · · · X r X r+1 X r+2 · · · X n−1 X n B Under the auxiliary design, the test statisticT I,n−1 is decomposed into the following first and second stage statistics, since b n ∈ I by design: However, if a n / ∈ I, then we do not have enough degrees of freedom with a single weight w has the same distribution as T (2) I,n conditional on the interim data D (1) . Under H I , we have Equating the conditional means and variances gives n−1 and η n = 1/(w For notational convenience, define the following functions if a n / ∈ I We can express the weights w (0) n,1 and w (0) n,2 for the controls, for either a n ∈ I or a n / ∈ I, as w The final test statistic for testing hypothesis H I is as follows: We reject H I ifT I is greater than z α (1/n I + 1/n 0 ) 1/2 . Appendix B: Numerical example for fully sequential response-adaptive trials As a simple illustration of how the weights change over the course of a trial, consider testing h = 2 experimental treatments. We set α = 0.05, n 0 = 10, n = 11 and r 1 = r 2 = 1. Suppose we have no a priori reason to favour one treatment over the other, and so we simply choose the auxiliary design to be an equal randomization of the two treatments: b = 1 2 2 1 2 2 1 1 2 1 * Here the vertical line indicates where the burn-in period ends, and the * represents the allocation for b n , which by design must satisfy b n ∈ I. We set m 0,1 = 9 and m 0,2 = 1, so that w Below are the weights w (I) , w 1 and w (0) 10 for a variety of actual allocations. Table 6: An actual allocation a that is almost the same as the auxiliary design b. The weights that would be used in the naïve z-test are n 0 = 10, n 1 = 4 and n 2 = 7. a = 1 2 2 2 1 2 2 1 2 1 2 b = 1 2 2 1 2 2 1 1 2 1 * w (1) = 6 6 6 5.  Table 7: An actual allocation a that is the opposite of the auxiliary design b. The weights that would be used in the naïve z-test are n 0 = 10, n 1 = 6 and n 2 = 5.  Table 8: An extreme actual allocation a that is equal to 1 after the burn-in period. The weights that would be used in the naïve z-test are n 0 = 10, n 1 = 10 and n 2 = 2.  Table 9: An extreme actual allocation a that is equal to 2 after the burn-in period. The weights that would be used in the naïve z-test are n 0 = 10, n 1 = 1 and n 2 = 10. a = 1 2 2 2 2 2 2 2 2 2 2 b = 1 2 2 1 2 2 1 1 2 1 * w (1) = 6 6 6 5.16 5.16 5.16 4.28 3.33 3.33 2.23 -w Actual design a B a 1 a 2 · · · a J X B X 1 X 2 · · · X J Auxiliary design Here a B = (a 1 , . . . , a r ) and X B = (X 1 , . . . , X r ) refer to the burn-in period B, while a j = (a D j−1 +1 , . . . , a D j ) and X j = (X D j−1 +1 , . . . , X D j ) represent the response-adaptive allocations and observations in block j (j = 1, . . . , J).
As before, we require b n ∈ I.
Step 1 In step 1 we only consider the response-adaptive allocations for the first block a 1 . We view the auxiliary and actual trials as coming from a two-stage design, where the first stage for both is the burn-in period B, as shown below.
Auxiliary design (step 1) Given the interim data from B, we can determine the actual allocations a 1 for the first block. Hence the second stage for the actual design in step 1 is a data-dependent modification of the auxiliary design, where the allocations b 1 are set to a 1 . All the other allocations for the actual design remain the same as the auxiliary design.
Under the auxiliary two-stage design, the test statistic T I is decomposed into two parts, with is calculated from the first stage data and T (2) I is calculated from the second stage data. More explicitly, Following the conditional invariance principle, we select weights w where λ 1 = m I,1 /n I − m 0,1 /n 0 and η 1 = m I,1 /(n I ) 2 + m 0,1 /(n 0 ) 2 .
Hence the full modified statistic for the actual design in step 1 is

Inductive step
We continue the process above, at each step taking forward the actual design as the new auxiliary design. The actual design at step l of the process (l ∈ {1, . . . , J − 1}) is a modification of the new auxiliary design where the allocations b l are set to a l . The valid test statistic for the new auxiliary design isT I,l , taken forward from the previous step of the process. The diagrammatic representation of step l of the process is given below: Auxiliary design (step l) Stage 1 a B a 1 · · · a l−1 Using these auxiliary and actual designs, we select new weights w The corresponding test statisticT I,l is: where we define D 0,−1 = 0.

Final step
In the final step of the process, the second stage data for the auxiliary and actual designs is the final block: Auxiliary design (final step) Stage 1 a B a 1 · · · a J−1 a J X B X 1 · · · X J−1 X J Under the auxiliary design, the test statisticT I,J−1 is decomposed into the following first and second stage statistics, where b n ∈ I by design: In this case, the final test statistic for testing hypothesis H I is as follows: We reject H I ifT I is greater than z α (1/n I + 1/n 0 ) 1/2 . However, if m I,J = 0 then we do not have enough degrees of freedom with a single weight w (0) J to match both the conditional means and variances. In this case, since by design m 0,J = d 0J > 1, we consider separately the first n 0,J,1 control observations and the next n 0,J,2 control observations, where n 0,J,1 > 0, n 0,J,2 > 0 and n 0,J,1 + n 0,J,2 = d 0J . In order to keep the weights as close to the natural weight n 0 for as many of the control observations as possible, we recommend setting n 0,J,1 = d 0J − 1 and n 0,J,2 = 1, which is what we use for the simulation studies in Section 4.3 of the paper.
Letting D 0,J,1 = D 0,J−1 + n 0,J,1 , we select weights w In this case, the final test statistic for testing hypothesis H I is as follows: We reject H I ifT I is greater than z α (1/n I + 1/n 0 ) 1/2 .
The trial starts with a a burn-in period B, which allocates r 0 > 0 patients to the control and r i > 0 patients to the ith treatment (i = 1, . . . , h), where r 0 and the r i are again fixed in advance. Hence a total of r = h i=0 r i patients are allocated to the experimental treatments during the burn-in period.
The auxiliary design for hypothesis H I starts with a burn-in period B with r patients that is identical to the actual design. The subsequent n − r − 2 allocations are given by a fixed sequence (b r+1 , . . . , b n−2 ). The allocation b n−1 is to the control, while the allocation b n must be in I. For the auxiliary design, let n 0 and n i denote the total number of allocations to the control and the ith treatment respectively (i = 1, . . . , h), including the burn-in period.
Step 1 In step 1 we only consider the response-adaptive allocations for the first block a 1 . We view the auxiliary and actual trials as coming from a two-stage design, where the first stage for both is the burn-in period B, as shown below.
Auxiliary design (step 1) Under the auxiliary two-stage design, the test statistic T I = T (1)

I +T
(2) I for the experimenal treatments is decomposed into two parts, where T (1) I is calculated from the first stage data and T (2) I is calculated from the second stage data. More explicitly, We now select weights w has the same distribution as T To match the conditional distributions we equate the conditional means and variances to give where λ 1 = m I,1 /n I − m 0,1 /n 0 and η 1 = m I,1 /(n I ) 2 + m 0,1 /(n 0 ) 2 .
Hence the full modified statistic for the actual design in step 1 is

Inductive step
We now repeat the process above, at each step taking forward the actual design as the new auxiliary design. The actual design at step l of the process (l ∈ {1, . . . , J − 1}) is a modification of the new auxiliary design where the allocations b l are set to a l . The valid test statistic for the new auxiliary design isT I,l , taken forward from the previous step of the process. The diagrammatic representation of step l of the process is given below.

Auxiliary design (step l)
Stage 1 a B a 1 · · · a l−1 We reject H I ifT I is greater than z α (1/n I + 1/n 0 ) 1/2 . However, if m I,J = 0 and m 0,J > 1, then we do not have enough degrees of freedom with a single weight w (0) J to match both the conditional means and variances. In this case, we consider separately the first n 0,J,1 control observations and the next n 0,J,2 control observations, where n 0,J,1 > 0, n 0,J,2 > 0 and n 0,J,1 + n 0,J,2 = m 0,J . As before, we recommend setting n 0,J,1 = m 0,J − 1 and n 0,J,2 = 1, which is what we use for the simulation studies in Section E.1.
Suppose the (D 0,J,1 )th patient receives the (n 0,J,1 )th allocation to the control in block J. We select weights w

E.1 Block randomization with an adaptive allocation to the control
We consider block randomization with an adaptive control allocation, as presented in Section 3.4 in the paper. We use the setup of a trial with J = 3 blocks and sizes (50,50,50). In the burn-in period, 5 patients are allocated to each of the treatments including the control. We again set the true control mean µ = 0, and α = 0.05.
Type I error inflator : The allocation probabilities for block j ∈ {1, . . . , J − 1}, patient k = D j + 1, . . . , D j+1 and treatment l ∈ {0, 2, . . . , h} are: Bayesian adaptive randomization: The priors and posteriors are the same as in Section 4.3 in the paper, and we use a similar Bayesian adaptive randomization scheme. If there are h experimental treatments, then the randomization probabilities (π 0 , π 1 , . . . , π h ) at the (j + 1)th stage are: wherem ij is the current arm-specific sample size for the ith treatment at the end of the jth stage. In our simulations, for simplicity we set the priors µ i,0 = 0 and σ 2 i,0 = 1, while γ = 0.5 and ν = 0.1.
Simulation results: Table 10 gives the results for the type I error inflator randomization scheme, while Table 11 gives the result for BAR. For each scenario, we ran 10 5 simulated trials. The auxiliary designs in all scenarios were random draws from a discrete uniform distribution on {0, 1, . . . , h}.
The results here are again broadly similar to those for the fully sequential setting, and the block randomization setting with a fixed allocation to the control. For the type I error inflator, the various z-tests do not strongly control the FWER. The adaptive tests do achieve strong error control, but this comes at the cost of a very large decrease in power when compared with the Holm z-test.
For the BAR scheme, again all methods strongly control the FWER. This time, the z-tests have  We also considered how often at least one imaginary or negative weight occurs over the 10 5 simulations for the two randomization schemes. Tables 12 and 13 give the percentage of simulations where the weights for the experimental treatments are imaginary or negative, for the type I error inflator and BAR scheme respectively.

E.2 Power of the adaptive test
The adaptive tests can pay a large price in terms of power when compared with the z-tests, as seen in the results for the type I error inflator. In order to understand what is happening in this setting, we conducted an additional simulation study. Suppose we are testing h = 2 treatments, and that the randomization scheme used is simply a fixed allocation to the experimental treatments, but with unequal randomization probabilities. Let p 2 denote the probability of assignment to treatment 2.
Firstly consider the fully sequential trial setup of Section 4.2 in the paper, with δ 1 = 0, δ 2 = 0.7. Figure 1 shows how the power of the Holm adaptive test and z-test compares as p 2 varies. We see that when p 2 > 0.5, the adaptive test only suffers a small loss of power compared to the z-test. However, when p 2 < 0.5, the adaptive test loses an increasing amount of power.
Now consider the block randomization setup of Section 4.3 in the paper, with δ 1 = 0, δ 2 = 0.5. Figure 2 shows that this time, the power of the adaptive test is very close, or even equal, to the z-test when p 2 > 1/3. This shows how the adaptive test is more robust in terms of power in the block randomization setting compared to the fully sequential version. Figure 3 shows how the powers differ for δ 1 = 0, δ 2 = 1 and p 2 < 0.2. We can see that when p 2 < 0.15, there is a noticeable and increasing divergence between the powers of the two tests. Indeed, when p 2 = 0 the power of the Holm z-test is three times that of the Holm adaptive test. This shows what is happening with the type I error inflator when δ 1 = 0, where in the majority of trial scenarios, apart from the unlikely event that treatment 1 stops early for 'efficacy', p 2 = 0 by design. Hence, the type I inflator is in fact close to a worst-case scenario for the adaptive tests. Probability of assignment to treatment 2 Power z-test (Holm) Adaptive test (Holm) Figure 1: Power of the Holm adaptive test and Holm z-test as a function of the probability of assignment to treatment 2. We use the fully sequential trial setup of Section 4.2 in the paper, with h = 2 treatments and δ 1 = 0, δ 2 = 0.7. Probability of assignment to treatment 2 Power z-test (Holm) Adaptive test (Holm) Figure 2: Power of the Holm adaptive test and Holm z-test as a function of the probability of assignment to treatment 2. We use the block randomized trial setup of Section 4.3 in the paper, with h = 2 treatments and δ 1 = 0, δ 2 = 0.5. Probability of assignment to treatment 2 Power z-test (Holm) Adaptive test (Holm) Figure 3: Power of the Holm adaptive test and Holm z-test as a function of the probability of assignment to treatment 2. We use the block randomized trial setup of Section 4.3 in the paper, with h = 2 treatments and δ 1 = 0, δ 2 = 1.

E.3 Using the pooled sample variance
In this section, we assume that σ 2 is unknown and will be estimated at the end of the trial using the pooled sample varianceσ 2 (as defined below) of the experimental treatments and control. More precisely, given the adaptive test statisticT I , to test hypothesis H I we compareT I /σ with the critical value z α (1/n I +1/n 0 ). We rerun all the simulation studies in Section 4 of the paper and Appendix E.1, with exactly the same setup except for this change in the test procedure. Due to the extra variability induced by estimating the pooled sample variance, we simulate 10 6 trials for each set of parameter values.
Given that there are h experimental treatments, the formula for the pooled sample variance iŝ where s 2 i is the sample variance for treatment i (i = 0, 1, . . . , h).
For an adaptive control allocation,  Table 14 gives the results for the type I error inflator, and Table 15 the results for BAR. Compared to assuming a known variance of σ 2 = 1, for all sets of parameter values both the FWER and disjunctive power increase slightly. For the type I error inflator, the Bonferroni-corrected z-test now has a FWER above 5% in scenarios 2 and 5, while the adaptive closed test now has a FWER of 5.1% in scenario 6.
However, the Holm adaptive test still achieves strong FWER control. For BAR, as before all the testing strategies control the FWER.   Table 16 gives the results for the type I error inflator, and Table 17 the results for BAR. Compared to assuming a known variance of σ 2 = 1, for all sets of parameter values both the FWER and disjunctive power increase slightly. For the type I error inflator, the same scenarios as before lead to an inflation of the FWER. In particular, both the adaptive closed test and the Holm adaptive test achieve FWER control. For BAR, as before all the testing strategies control the FWER.   Table 18 gives the results for the type I error inflator, and Table 19 the results for BAR. Compared to assuming a known variance of σ 2 = 1, for all sets of parameter values both the FWER and disjunctive power increase slightly. For the type I error inflator, the same scenarios as before lead to an inflation of the FWER. In particular, both the adaptive closed test and the Holm adaptive test achieve FWER control. However, for BAR, the adaptive closed test now has a FWER of 5.1% in scenario 2 (the FWER is controlled for all other scenarios and testing procedures).