Adaptive enrichment trials: What are the benefits?

Abstract When planning a Phase III clinical trial, suppose a certain subset of patients is expected to respond particularly well to the new treatment. Adaptive enrichment designs make use of interim data in selecting the target population for the remainder of the trial, either continuing with the full population or restricting recruitment to the subset of patients. We define a multiple testing procedure that maintains strong control of the familywise error rate, while allowing for the adaptive sampling procedure. We derive the Bayes optimal rule for deciding whether or not to restrict recruitment to the subset after the interim analysis and present an efficient algorithm to facilitate simulation‐based optimisation, enabling the construction of Bayes optimal rules in a wide variety of problem formulations. We compare adaptive enrichment designs with traditional nonadaptive designs in a broad range of examples and draw clear conclusions about the potential benefits of adaptive enrichment.


INTRODUCTION
Consider a Phase III trial in which it is believed a certain subset of patients will respond particularly well to the new treatment. We wish to test for a treatment effect in both the pre-identified subpopulation and the full population. Such multiple testing can be conducted using a closed testing procedure to control the familywise error rate (FWER). 1 In an adaptive enrichment design, if interim data suggest it is only the subpopulation that benefits from the new treatment, recruitment in the second half of the trial is restricted to the subpopulation. This increase in recruitment from the subpopulation is referred to as "enrichment" of the sampling rule.
We develop and assess designs which use a closed testing procedure with Simes' method 2 to test the intersection hypothesis and a weighted inverse normal combination test [3][4][5] to combine data from the two stages of the trial. We show that the resulting testing procedure controls the FWER, whatever rule is used to decide when enrichment should occur. This allows us to seek the enrichment rule which is optimal for a specified criterion. We shall follow the approach presented by Burnett, 6 defining a gain function that reflects the value of the outcome of the trial and a prior distribution for the treatment effects in the subpopulation and full population. The optimal decision at the interim analysis is that which maximises the expected gain with respect to the posterior distribution of the treatment effects, given current data. Since we use simulation in constructing the Bayes optimal decision rule for an adaptive design, our approach has the potential to be computationally expensive. We present an efficient algorithm for deriving this decision rule that significantly reduces the calculation required: using our methods, designs can be derived and tested in a matter of minutes on a laptop or PC.
In previous work on adaptive enrichment designs, Brannath et al 7 followed a Bayesian approach, assuming an uninformative prior for treatment effects. They determined the enrichment decision by comparing the posterior predictive probabilities of rejecting each hypothesis at the end of the trial with certain user-defined thresholds. Götte et al 8 considered families of enrichment rules defined in terms of linear combinations of the two treatment effect estimates or the conditional power to reject each hypothesis. They defined the "correct decision" at the interim analysis for given true values of the treatment effects and searched within their families of enrichment rules to maximise a weighted combination of the probabilities of a correct decision. Uozomi and Hamada 9 defined enrichment rules in terms of thresholds for the treatment effect estimates or predictive power for the two hypothesis tests and set these thresholds to optimize a utility function under specific values for the true treatment effects. Our methods are set in a more complete Bayesian decision theoretic framework. The gain function is chosen to summarize the benefits of the final decisions, reflecting the size of population in which the new treatment is proven to be effective and the magnitude of the treatment effect in this population. The decision whether or not to enrich at the interim analysis is informed by both the posterior distribution of treatment effects and the interim estimates or p-values that will form part of the final hypothesis tests.
Ondra et al 10 developed Bayes optimal methods in a class of adaptive enrichment designs where FWER is controlled by a Bonferroni adjustment, assuming a 4-point discrete prior distribution for the two treatment effects. These simplifications allow the optimal enrichment decision rule to be found by maximising an integral, which is computed numerically. The application of Simes tests in our methods reduces conservatism in the testing procedure and the continuous prior distributions are better able to capture investigators' prior beliefs. Although our form of problem requires the use of simulation to find an optimal design, this approach has the advantage of extending very easily to other forms of gain function and multiple testing methods.
Through studying optimal designs, we are able to assess the potential benefits of adaptive enrichment. We have studied a variety of scenarios, drawing comparisons in each case with two nonadaptive designs: sampling the full population throughout the whole study or focusing on the subpopulation at the outset and only recruiting subpopulation patients. We see there are plausible prior distributions for which the adaptive enrichment design is superior to both forms of nonadaptive design. Furthermore, we recognize that investigators may be reluctant to restrict recruitment to the subpopulation from the outset and observe that in situations where this would have been the optimal policy, adaptive enrichment can give substantially higher expected gain than the nonadaptive, full population design.
Our studies also shed light on the underlying reasons for the effectiveness of adaptive designs. The good performance of adaptive designs in the special case of one-point prior distributions shows efficiency gains can follow from adapting to interim data and the likelihood of eventual rejection of each null hypothesis. With proper prior distributions, one might expect increased knowledge about the true treatment effects at the interim analysis to give adaptive designs a further advantage. However, we find such benefits to be modest: when the prior variance is high, considerable uncertainty about the true treatment effects remains; when the prior variance is low, information about the treatment effects at the interim analysis comes primarily from the prior, not the interim data.
The paper is structured as follows. We formulate the problem in Section 2 and we present methods for controlling FWER and combining data across stages in Section 3. We describe methods for optimising an adaptive design in Section 4, describe two forms of nonadaptive design in Section 5 and present examples in Section 6. We conclude with discussion of the results obtained in our examples.
We suppose responses are normally distributed with a common variance 2 but note that, by large sample theory, distributions of treatment estimates will have the same form for a wide variety of response types. Let A1 and B1 be the expected responses for patients in  1 on Treatments A and B, respectively. Similarly, let A2 and B2 be the expected responses on Treatments A and B for patients in  2 . Letting X ij denote the response of the ith patient in subpopulation  j on Treatment A and Y ij the response of the ith patient in  j on Treatment B, we have The treatment effects in subpopulations  1 and  2 are 1 = A1 − B1 and 2 = A2 − B2 , respectively.
Suppose  1 represents a fraction of the full population. Then, the overall treatment effect in the full population is We shall write = ( 1 , 2 ), noting that determines the value of 3 . We assume the investigators are interested in testing H 01 : 1 ≤ 0 vs 1 > 0 and H 03 : 3 ≤ 0 vs 3 > 0. The hypothesis H 02 : 2 ≤ 0, is not to be tested (although one might require some evidence of a positive treatment effect in S 2 to support approval of the new treatment for the full population when H 03 is rejected). However, the approach we describe can also be applied when enrichment in either S 1 or S 2 is possible, or when there are more than two subpopulations; the key requirement is that the subpopulations and enrichment options are predefined.

Adaptive enrichment trial designs
If the new therapy is beneficial to all patients, we would hope to reject the null hypothesis H 03 and establish that there is an effect in the full patient population. However, if the benefit is restricted to patients in  1 , it would be advantageous to focus on this subpopulation and increase the probability of rejecting H 01 . Adaptive enrichment designs aim to balance these two objectives by using interim data to decide whether or not to restrict enrolment in the remainder of the study to  1 and test only H 01 . We consider trial designs with a single interim analysis that takes place after a fraction of the planned sample size has been recruited and responses from these patients have been observed. Initially, patients are recruited from the full population. If, at the interim analysis, results on the new therapy are promising in both  1 and  2 , recruitment continues across the full population. If, however, the new therapy only appears to benefit patients in  1 , the remainder of the sample size is devoted to  1 . Our objective is to optimize the rule for choosing between these two options in an adaptive enrichment design.
Let n be the total number of patients to be recruited. Assuming recruitment from  1 and  2 is in proportion to the size of these subpopulations, sample sizes at the interim analysis are n in  1 and (1 − ) n in  2 . When recruitment continues from the full population, an additional (1 − )n patients are sampled from  1 and (1 − )(1 − )n from  2 . If "enrichment" occurs and only patients from  1 are recruited after the interim analysis, there will be a further (1 − )n patients from  1 . We assume that, within each stage of the trial, patients in each subpopulation are randomized equally between Treatments A and B.
In describing the distributions of parameter estimates, it is helpful to definẽ Note that a fixed sample size trial with n patients divided equally between Treatments A and B would produce an estimatê3 with Var (̂3) = 4 2 ∕n, so = {Var (̂3)} −1 represents the Fisher information for 3 in this case.
Let m 11 = n∕2 and m 21 = (1 − ) n∕2. Then, in the form of adaptive enrichment design we have described, the first stage yields treatment effect estimateŝ The joint distribution of (̂( 1) 3 ) is bivariate normal with correlation √ . Suppose that after the initial analysis the trial continues in the full population. Then, setting m 12 = (1 − )n∕2 and m 22 = (1 − )(1 − )n∕2, the second stage data alone yield treatment effect estimateŝ( 2) Again, the pair of estimates (̂( 3 ) is bivariate normal with correlation √ . Alternatively, suppose the trial is enriched and only subpopulations  1 is sampled in the second stage. Then, setting m 12 = (1 − )n∕2, the new data yield the estimatê and no estimate of 3 is available.

Closed testing procedures
Control of the type I error rate in a confirmatory clinical trial is paramount 11 and, with two null hypotheses under consideration, the testing procedure should provide strong control of the FWER at the prespecified level . 1 Thus, we require P (Reject at least one true null hypothesis) ≤ for all .
We shall follow the general approach presented by Bretz et al, 12 Schmidli et al 13 13 . For an explanation of why such a procedure protects the FWER and why all procedures that provide strong control of FWER can be interpreted as closed testing procedures, see Appendix A.
We refer to the periods of an adaptive enrichment design before and after the interim analysis as stages 1 and 2. In our closed testing procedure, we need a method for combining test statistics for hypotheses H 01 and H 03 to test the intersection hypothesis H 0, 13 and a method to combine data across stages, bearing in mind that the decision about which subpopulations to recruit from in stage 2 depends on the stage 1 data. We describe these methods in the following sections.
If enrichment does not take place and stage 2 continues with recruitment from the full population, we define P (2) 1 and P (2) 3 to be p-values for testing H 01 and H 03 based on data from stage 2 patients alone. Then, just as for stage 1 data, we construct the Simes p-value for testing the intersection hypothesis H 0, 13 . If enrichment does take place, only patients from  1 are observed in stage 2 and we define the P-value P (2) 1 for H 01 based on these observations. We cannot define a P-value P (2) 3 but this is not a problem as we no longer plan to test H 03 . In this case we set noting that H 0, 13 implies 1 ≤ 0 and hence P (2) 13 = P (2) 1 is Unif(0, 1), or stochastically larger than this, under H 0,13 .

The weighted inverse normal combination test
In constructing level tests of H 01 , H 03 , and H 0,13 , we need to combine P-values from the two stages. In each case, we do this using a weighted inverse normal combination test. [3][4][5] Consider first the level test of H 01 . The stage 1 data give and the associated P-value is P (1) where Φ denotes the cumulative distribution function of a standard normal random variable. If the trial recruits from the full population in stage 2, we have while, if enrichment occurs, we have and in either case the associated P-value is P (2) 1 = 1 − Φ(Z (2) 1 ). Suppose 1 = 0. Then, Z (1) 1 ∼ N(0, 1) and P (1) 1 ∼ Unif(0, 1). Conditional on the first stage data, Z (2) 1 ∼ N(0, 1) and P (2) 1 ∼ Unif(0, 1). Since the conditional distribution of Z (2) 1 does not depend on the stage 1 data, we conclude that Z (1) 1 and Z (2) 1 are independent N(0, 1) random variables. Using pre-specified weights w 1 and w 2 for which w 2 1 + w 2 2 = 1, we define the combination test statistic and note that Z (c) 1 ∼ N(0, 1) when 1 = 0. Suppose now that 1 < 0. We can write where (1) 1 ∼ N(0, 1) and has type I error rate less than or equal to whenever 1 ≤ 0, as required.
We construct a level test of H 03 in a similar way to that of H 01 . We have from stage 1 data and, if enrichment does not occur, we have from stage 2 data. In the case of no enrichment, we create the combination test statistic The proof that this test controls the type I error rate follows the same lines as that for the test of H 01 but, since we do not test H 03 at all when enrichment occurs, this test is conservative The level test of the intersection hypothesis H 0, 13 is constructed from the P-values P (1) 13 and P (2) 13 as defined in Equations (2), (3) and (4). Under H 0, 13 , the positive correlation between̂( 1) 13 is stochastically larger than a Unif(0, 1) random variable, even when 1 is stochastically smaller than a N(0, 1) random variable and we can write where (1) 13 ∼ N(0, 1) and 1 is a positive random variable, not necessarily independent of (1) 13 . If no enrichment occurs, by similar reasoning, the conditional distribution under H 0, 13 of Z (2) 13 = Φ −1 (1 − P (2) 13 ), given stage 1 data, is stochastically smaller than a N(0, 1) random variable. If enrichment does occur, Z (2) 13 = Z (2) 1 and has conditional distribution N( 1 √ {(1 − )}, 1) given stage 1 data. It follows that, under H 0, 13 , we can write  where (2) 13 ∼ N(0, 1) is independent of (1) 13 and 2 is a positive random variable that may depend on (1) 13 and (2) 13 . It follows from Equations (5) and (6) that, under H 0, 13 , is stochastically smaller than a N(0, 1) variable. Hence, the test that rejects H 0, 13 if Z (c) 13 > Φ −1 (1 − ) has type I error rate less than or equal to whenever 1 ≤ 0 and 3 ≤ 0.

Summary of the overall testing procedure
Let S(P 1 , P 2 ) = min{2 min(P 1 , P 2 ), max(P 1 , P 2 )}, be the function that converts P 1 and P 2 into a Simes P-value and and define the function that gives the P-value when a weighted inverse normal combination test with weights w 1 and w 2 is applied to stage 1 and 2 P-values P (1) and P (2) . With this notation, Table 1 presents a summary of the closed testing procedure described above.
In a trial where enrichment does not occur and patients are recruited from the full population in stage 2, we reject H 01 overall if P (c) 1 ≤ and P (c) 13 ≤ , and we reject H 03 overall if P (c) 3 ≤ and P (c) 13 ≤ . If enrichment occurs, H 01 is rejected overall if P (c) 1 ≤ and P (c) 13 ≤ but it is not possible to test H 03 as there is no P (2) 3 to use in the combination test of H 03 ; this is in keeping with the decision to enrich which implies it is no longer desired to test H 03 .

Bayesian decision framework
An enrichment design, as described in Section 2.2, that applies the closed testing procedure presented in Section 3 will protect the FWER regardless of the decision rule that determines when to enrich in stage 2. This gives us the opportunity to apply Bayesian decision theory 17 to optimize the enrichment decision rule for our chosen criterion. This decision theoretic approach requires the specification of a prior distribution for and a gain, or utility, function that assigns a value to the final outcome of the study.
The decision rule. We denote the sufficient statistic for = ( 1 , 2 ) based on stage 1 data by 2 ). Note that ( 1 , 2 ) determines ( 1 , 3 ) and vice versa, so X 1 is also the sufficient statistic for ( 1 , 3 ). We shall consider decision rules that are functions of X 1 . The decision under rule d is specified through the function d(X 1 ) taking values in {1, 2}, with The form of the sufficient statistic X 2 for based on stage 2 data depends on which decision is taken. If d(X 1 ) = 1, enrichment occurs and X 2 =̂ (   2) 1 , while if d(X 1 ) = 2 enrichment does not occur and 2 ). In either case we write X = (X 1 , d(X 1 ), X 2 ) to summarize the full set of data at the end of the study and the decision taken at the interim analysis.
The prior distribution for . We assume a continuous prior distribution for = ( 1 , 2 ) is specified and we denote the probability density function of the prior distribution by ( ).
The gain function. The gain function G( , X) denotes the value assigned to the outcome of the study when is the parameter vector and we observe X = (X 1 , d(X 1 ), X 2 ). Note that we can deduce from X which of the hypotheses H 01 and H 03 are rejected in the final analysis.
Let  1 be the indicator variable of the event that H 01 is rejected but H 03 is not rejected, and let  3 be the indicator variable of the event that H 03 is rejected. Both  1 and  3 are functions of X. In this paper we shall consider the gain function Here, the gain is deemed to be proportional to the size of the population for which a treatment effect is found and also to the average treatment effect for patients in that population.
Other forms of gain function are possible: the key feature is that they are constructed based on the possible outcomes of the trial. A general form of gain function should capture the importance of each of these possible outcomes, for example, if we define 1 ( , X) to represent the benefit of rejecting H 01 and 3 ( , X) to represent the benefit of rejecting H 03 , then the gain function will be The choice of 1 ( , X) and 3 ( , X) may reflect both the treatment effect as seen in Equation (8) and the estimates of 1 and 3 which can be constructed from X. In our formulation of the design question, the total sample size is fixed, so we have not included a cost of treating patients in the study in the overall gain function: such a cost would be required if we were to include the option of stopping for futility at the interim analysis. One could also consider adding other important outcomes from the trial such as the safety profile of the treatment. The application of the methods that follow is not particularly dependent on the choice of gain function, although the choice of gain function will influence what is optimal.

Computing the Bayes optimal design
With the prior distribution and gain function G specified, we wish to find the decision rule d that maximises the Bayes expected gain of the trial E{G( , X)}, where the expectation is over both the prior distribution for and the distribution of X given . We denote the conditional density function of X 1 given by f X 1 | (x 1 | ), the density of the marginal distribution of X 1 by f X 1 (x 1 ), and the conditional density of X 2 given and decision d( Then the expected gain when applying decision rule d is It is evident from (9) that the optimal decision rule can be found by choosing d(x 1 ) to maximize for each x 1 . That is, we choose the enrichment decision that maximizes the conditional expected gain given the stage 1 data under the posterior distribution of at the interim analysis. Given observed stage 1 data 2 ), we need to compare values of the integral (10) in the two cases d(x 1 ) = 1 (enrichment) and d(x 1 ) = 2 (no enrichment). Since this integral is not analytically tractable, we evaluate it by Monte Carlo simulation. To do this, we draw a sample { i = ( i,1 , i,2 ), i = 1, … , M}, from the posterior distribution |X 1 ( |x 1 ) and find the conditional expected gain under each i for the two options, "enrich" and "do not enrich." We take the average gain over this sample of i values as our estimate of the conditional expected gain for each option. We conclude that the decision d(x 1 ) giving the larger of the two values for the conditional expected gain is the Bayes optimal decision when In assessing the decision to enrich, d( 2 ) we apply the definitions of Section 3 to find the critical value (x 1 ) such that̂ (   2) 1 ≥ (x 1 ) implies P (c) 1 ≤ and P (c) 13 ≤ , so H 01 is rejected in the closed testing procedure. We compute P(̂( , M and combine the results to obtain the estimate of the conditional expected gain If d(x 1 ) = 2 and the trial continues without enrichment, the possibilities in stage 2 are more complex. In this case, for each i = 1, … , M we continue to simulate the remainder of the trial by generating (̂( i,2 )). Combining these results gives the estimate of the conditional expected gainÊ The value of M used in these simulations should be chosen to give the desired level of accuracy. We have found M = 10 5 or 10 6 to give sufficient accuracy in the examples we have studied.

Determining the decision rule and decision boundary
In order to find the operating characteristics of a proposed adaptive enrichment design we must be able to repeatedly simulate the design in full. This requires repeated application of the interim decision rule that specifies the optimal design for a given prior and gain function G: thus we need to know the optimal decision for all possible values of 2 ). We present an algorithm that enables the computation of the optimal decision rule over a large square region, A, such that P(X 1 ∈ A) is very close to 1. The algorithm divides this region into an array of much smaller squares and determines the optimal decision for values of x 1 in each small square. With simple extrapolation beyond the boundaries of A, this process divides the plane into two regions, A E where the optimal decision is to enrich, and A C where it is optimal to continue recruitment in the full population.
Experience shows that the two regions A E and A C are quite regular in shape and this fact allows us to reduce the computation needed to find the optimal decision rule. We first divide A into four subsquares and determine the optimal decisions at the vertices of these squares. Then, if the same decision is optimal at all four vertices we record this as the optimal decision for all points in that square. If, however, both decisions are optimal for at least one vertex we subdivide this square into four smaller squares. In the next iterative step, we consider the set of squares of the smallest size and for each of these we either record an optimal decision for the whole square or subdivide the square into four smaller ones. We continue this iterative process until we reach squares of the desired size. Further details of this method and a discussion of its accuracy are given in Appendix B. The results of these calculations are 2-fold. First, the list of optimal decisions for each small square provides the information needed to implement the optimal adaptive decision rule. Secondly, the results can be presented graphically to help visualize the optimal decision rule.

Assessing the performance of an optimized trial design
Suppose the decision rule of an optimized adaptive enrichment design is defined by regions A E and A C as described above.
We assess the overall performance of this design by simulation. For each replicate i = 1, … , N, we generate a parameter vector i = ( i,1 , i,2 ) then simulate stage 1 data i,2 ) assuming = i . We determine whether x i, 1 is in A E or A C , set d(x i, 1 ) = 1 or 2 accordingly, and apply this decision, still assuming = i , as we generate the stage 2 data: . Finally, we determine which hypotheses are rejected and evaluate the gain function for these outcomes when = i . Averaging over the N replicates gives the estimatê The same set of simulated data can be used to estimate other properties of the design such as the probabilities of rejecting each null hypothesis. In our simulations we have used N = 10 6 , so sampling error for the estimates reported is negligible. One might ask whether it would be helpful to generate multiple replicates of the stage 2 data for each i and x 1, i . However, the distribution of i and x 1, i accounts for much of the variability of G( , X) and it is more efficient to use the available computational effort to increase the number of replicates, N, of the first stage data. Of course, this approach relies on our having carried out initial work to find the regions A E and A C that define the optimal decision rule, and in doing this we will have generated multiple samples of stage 2 data conditional on particular values of X 1 .

TWO NONADAPTIVE DESIGNS
There are two further options that should be considered when an adaptive enrichment design is envisaged. The first is a design in which patients are recruited from the full population throughout the trial, but both null hypotheses H 01 and H 03 are tested at the end. We shall refer to this as the Fixed Full population (FF) design. The other possibility is a Fixed Subpopulation (FS) design, in which subjects are only recruited from the subpopulation and only the hypothesis H 01 is tested.
The Fixed Full population design. For comparability with other designs, we assume the same total sample size, n, as in Section 2.2. Thus, n patients are recruited from  1 and (1 − )n from  2 . With as defined in (1), the data provide estimateŝ1 and the joint distribution of (̂1,̂3) is bivariate normal with correlation √ . The P-values for testing H 01 and H 03 are respectively, and Simes' method gives the p-value P 13 = min{2 min(P 1 , P 3 ), max(P 1 , P 3 )} for the intersection hypothesis H 0, 13 . Applying the closed testing procedure, we reject H 01 overall if P 1 ≤ and P 13 ≤ , and we reject H 03 overall if P 3 ≤ and P 13 ≤ .
There are reasons why the FF design may be more efficient than the optimal adaptive design if the prior ( ) is concentrated on values of under which enrichment is unlikely to occur. Suppose an adaptive design is conducted and enrichment does not occur. With suitable weights in the combination rule (7), the adaptive design's P-values P (c) 1 and P (c) 3 , as shown in Table 1, are equal to the P 1 and P 3 obtained when the same data are observed in the FF design. However, P (c) 13 = W(P (1) 13 , P (2) 13 ) differs from the P 13 arising from the same data in the FF design. Since P 13 in the FF design is based on the sufficient statistics for 1 and 3 in the full data set, it provides a more powerful test of H 0, 13 than the adaptive design's P (c) 13 . The requirement to use P (c) 13 rather than P 13 to test H 0, 13 is the price we pay for the adaptive design's flexibility to enrich on other occasions: if such occasions are not particularly likely under the prior ( ), it is plausible that the FF design will be superior.
The Fixed Subpopulation design. In the FS design, all n subjects are recruited from  1 . These provide the estimatê 1 ∼ N( 1 , −1 )), and the P-value and H 01 is rejected if P 1 ≤ . In this design H 03 is not tested. We can expect the FS design to perform well when the prior ( ) is such that the optimal adaptive design is highly likely to enrich. Then, the FS design has the benefit of a larger sample size from  1 and, hence, a more accurate esti-matê1. Furthermore, the FS design only tests H 01 and so does not have to make a multiplicity adjustment for testing two hypotheses.

One-point prior distributions
We consider a Phase III clinical trial as described in Section 2.1 where the subpopulations  1 and  2 are of equal size, so = 0.5. We set the FWER to be = 0.025 and suppose the total sample size n would provide power 0.9 to detect a treatment effect of size 10 when testing only the hypothesis H 03 in a nonadaptive design. This leads to the total informatioñ  = ( Φ −1 (0.9) + Φ −1 (0.975) 10 ) 2 = 0.105, which is, for example, the information provided by a total sample size n = 264 when patient responses have standard deviation = 25. In adaptive enrichment designs we suppose the interim analysis occurs after half the total sample has been observed, thus = 0.5. Then, with = 0.5, = 0.5 and = 0.105, the interim estimateŝ (   1) 1 and̂ (   2) 1 have SD 6.15. In order to gain insight into how adaptive designs function and what they may achieve, we first consider cases where the prior distribution for places probability mass 1 at a single point, = 0 = ( 0,1 , 0,2 ). For given 0 , we derived the decision rule for the adaptive enrichment (AE) design that maximises the expected gain, using the gain function G( , X) specified in (8). For comparison, we also computed properties under = 0 of the FF design, which recruits from the full population throughout the trial, and the FS design which only recruits from the subpopulation. Results presented in Table 2 for selected values of 0 show each type of design, FF, FS, and AE, to be optimal for certain values of 0 .
We carried out further calculations on a grid of values of 0 to find the regions where each type of design is optimal. These regions are shown in Figure 1.
We note that the FF design is optimal when 0,3 = 0.5 ( 0,1 + 0,2 ) is large or 0,1 is only a little larger than 0,2 . The FS design is optimal when 0,1 is substantially larger than 0,2 and 0,2 is small. This leaves a region of 0 values where the AE design is optimal, offering a modest increase in expected gain over both fixed designs. The advantage of the AE design over the FF design is largest in cases such as 0 = (10, 2) and 0 = (12, 2), where 0,2 is small and the AE design has a high probability of enrichment and rejection of H 01 only. Although the FS design has even higher expected gain in these cases, investigators may be reluctant to make such an early decision to ignore subpopulation  2 completely, in which case the key comparison is between AE and FF designs. TA B L E 2 Properties of fixed subpopulation (FS), fixed full population (FF), and optimal adaptive enrichment (AE) designs when = 0 = ( 0,1 , 0,2 ). Here P( 1 ) is the probability that only H 01 is rejected and P( 3 ) the probability that H 03 is rejected. The AE design is optimized for the prior distribution with probability 1 at the single point = 0 . In each case, the design with the highest expected gain is highlighted 0,1 0,2 0,3 In extreme cases such as = (10, 10) where both 0,1 and 0,2 are high, there is a high probability that the AE design does not enrich and so has the same final dataset as the FF design. As discussed in Section 5, the AE design uses a different form of P (c) 13 and this leads to less efficient use of the final data when enrichment does not occur and a lower expected gain than for the FF design.

Trial design P( 1 ) P( 3 ) P(Enrich) E{G( , X)}
Since the AE design is optimized with knowledge of the value of 0 , its advantage when it is superior to both fixed designs does not stem from having improved estimates of the true treatment effects at the interim analysis. Rather, the decision to enrich or not is based on the likelihood that current data, summarized as (̂( distribution for is more dispersed, since then it can also exploit the information about that becomes available at the interim analysis. We shall assess the performance of designs under dispersed prior distributions for in the next Section.

Proper prior distributions for
In practice, one expects there to be considerable uncertainty about the true treatment effect. We capture this uncertainty in a bivariate normal prior distribution for , )) . Figure 2 shows the enrichment decision rule for the Bayes optimal adaptive enrichment trial when 1 = 12, 2 = 2,  (   2) 1 are low so rejection of H 01 is also unlikely: one could add a rule to stop for futility in such cases. When̂ (   1) 1 is high, so that rejection of H 01 is very likely, the trial is not enriched, even for lower values of (1) 2 , as long as it is feasible that H 03 will also be rejected. Table 3 shows properties of the Bayes optimal AE design, along with properties of the nonadaptive FF and FS designs, for prior distributions centred at the values of 0 considered in Table 2 but with 2 1 = 2 2 = 25 and = 0.75. In contrast with the results of Table 2, the AE design has higher expected gain than the FS design in all these examples with a dispersed prior.
The AE design has higher expected gain than the FF design in six of the ten examples -but the margin of superiority is not great. Thus, there is not much evidence that the enrichment design profits from information about at the interim analysis. The explanation for this is that, in the examples of Table 3, the posterior distribution of after seeing the interim data is still widely dispersed, with the SDs for 1 and 2 equal to 3.59. This is not just a feature of our particular examples. Suppose a study's total sample size is chosen so that a final test of H 03 : 3 ≤ 0 with type I error rate 0.025 has power 0.9 when 3 = . With no enrichment, the SD of the final̂3 is 0.31 . If there are two equally sized subpopulations, the interim estimates of 1 and 2 based on half of the total data have SD 0.62 . The posterior variance of 1 and 2 at the interim F I G U R E 2 An example of a Bayes optimal decision rule for an adaptive enrichment trial [Colour figure can be viewed at wileyonlinelibrary.com] analysis depends on the prior variances of 1 and 2 and, to a small degree, on the prior correlation. If, as in the examples of Table 3, the prior has Var ( 1 ) = Var ( 2 ) = ( ∕2) 2 , the posterior SDs of 1 and 2 at the interim analysis will be around 0.36 and a credible interval for 1 or 2 could easily contain both 0 and . On the other hand, the lower prior variances Var ( 1 ) = Var ( 2 ) = ( ∕4) 2 lead to posterior SDs around 0.23 -only slightly lower than the prior SDs of 0.25 . Thus, in cases where the prior variance is high, considerable uncertainty about 1 and 2 remains at the interim analysis, while if the prior variance is low, the interim data have little impact on the posterior distribution of 1 and 2 . Table 4 presents results for a further selection of prior distributions for . The examples show that the prior correlation, , has a small effect on expected gain but very little effect on the relative performance of different designs.
In cases with ( 1 , 2 ) equal to (10,2) or (12,2) and low prior variance, the FS design is best-but it is substantially inferior to the FF and AE designs in other situations. We conclude that the FS design option should only be considered if there is a strong prior belief that the new treatment will offer little or no benefit to subpopulation  2 .
For the cases in Table 4, the AE design has higher expected gain than the FF design (with the exception of a couple of cases where the two designs have almost equal expected gain). However, we have failed to find an example where the AE design is vastly superior to both the FS and FF designs: the example in Table 3 with ( 1 , 2 ) = (14, 2) and 2 1 = 2 2 = 25 and the examples in Table 4 with ( 1 , 2 ) = (12, 2) and 2 1 = 2 2 = 16 have the highest difference in expected gains in favor of the AE design. One may also argue from the values of P( 1 ) and P( 3 ) in Tables 2 and 3 that the AE design shows greater selectivity and is less likely to conclude the new treatment is beneficial to the full population when the treatment effect in  2 is small or absent altogether.

Adjusting other design parameters
When planning an enrichment trial it is natural to investigate all design parameters and, where possible, optimise their values. Here we consider the timing of the interim analysis at which the decision to enrich may be taken but we note that a similar approach can be taken in setting other design features. Suppose, with the problem formulation described above, we wish to find the best value of when the prior distribution of ( 1 , 2 ) is given by 1 = 12, 2 = 4, 2 1 = 2 2 = 25 and = 0.75. We have applied our methods to find the Bayes optimal design for different values of . Here we used weights w 1 = √ and w 2 = √ 1 − in the combination test to account for the different sample sizes before and after the interim analysis. Table 5 shows properties of designs with values of ranging from 0.1 to 0.9. We see that our earlier choice of = 0.5 yields the highest expected gain of 6.91, but designs with between 0.3 and 0.6 are very close to this optimum. As increases from 0.1 to 0.7, the probability of enriching the trial increases. This is in keeping with the information in Table 3 that the FF design is superior to the FS design, so a certain amount of data is needed to show that enrichment is the better option in a particular trial. We have seen similar results in other examples where the the FF design is superior to the FS design: AE designs with a range of values perform well, as long as is high enough to give enough information to make an informed decision about enrichment. TA B L E 3 Properties of fixed subpopulation (FS), fixed full population (FF), and optimal adaptive enrichment (AE) designs when has the prior distribution given by (13). Here P( 1 ) is the probability that only H 01 is rejected and P( 3 ) the probability that H 03 is rejected A somewhat different pattern is seen in scenarios where the FS design gives a high expected gain. Suppose the prior distribution for ( 1 , 2 ) has 1 = 12, 2 = 2, 2 1 = 2 2 = 4 and = 0.75. We saw in Table 4 that the FS design has higher expected gain than both the FF design and the optimal AE design with = 0.5. Table 6 shows results for optimal AE designs with different values of .
Since we have used weights w 1 = √ and w 2 = √ 1 − in the combination test, as decreases toward zero the analysis after enrichment becomes identical to that of the FS design. This explains why the probability of enrichment is high for small values of and the expected gain is very close to that of the FS design. In fact, the optimal AE designs with = 0.1, 0.2 and 0.3 have marginally higher expected gain than the FS design. Thus, an adaptive design with an early interim analysis could be a suitable choice if investigators are reluctant to restrict attention to subpopulation  1 from the outset.

TA B L E 4 Properties of fixed
subpopulation (FS), fixed full population (FF), and optimal adaptive enrichment (AE) designs when has the prior distribution given by (13)

TA B L E 7
Properties of fixed subpopulation (FS), fixed full population (FF), and optimal adaptive enrichment (AE) designs for different subpopulation sizes when has the prior distribution given by (13) with 1 = 14, 2 = 2, 2 1 = 2 2 = 25, and = 0.75. The subpopulation  2 represents a fraction of the total population

Effect of the subpopulation size
In all of our examples so far, the subpopulation  1 has represented half of the total population. The size of the specified subpopulation is a feature of the study and not a parameter that can be controlled. Table 7 shows the effect of the subpopulation size on the relative performance of different designs. In this example, the prior distribution for ( 1 , 2 ) has 1 = 14, 2 = 2, 2 1 = 2 2 = 25 and = 0.75, and we saw in Table 3 that the optimal AE design is the best option when = 0.5. The results in Table 7 show that the optimal AE design remains superior to both the FF and FS designs across the whole range of values from 0.1 to 0.9.
For each design, the expected gain for all designs increases with as the fraction of the population in which the treatment effect is 1 becomes larger. The margin of superiority of the AE design over the FF design is largest for = 0.2 and = 0.3. The reasons behind this are quite complex. The potential benefits of adaptive enrichment are small when is close to zero or 1 and one of the subpopulations forms a large fraction of the total population. Also, the interim estimate of 1 has a high variance when is small and the estimate of 2 has a high variance when is large, reducing the information available when making the interim decision. Nevertheless, it is clear from this example that adaptive enrichment can be of benefit over a wide range of subpopulation sizes.

DISCUSSION
We have considered adaptive trial designs for testing the efficacy of a new treatment when a prespecified subpopulation is deemed particularly likely to benefit from the new treatment. The methods we have presented facilitate calculation of the Bayes optimal rule for deciding whether to enrich in a design where the familywise type I error rate is controlled by a closed testing procedure and combination test. Since this calculation relies on Monte Carlo simulation to determine the optimum decision at all possible values of (̂( 1) 2 ), efficient calculation is crucial. We achieve this by use of an algorithm that makes intensive computations along a one-dimensional strip of (̂( 1) 2 ) values, rather than on a fully two-dimensional grid. The use of simulation means that this approach is highly flexible and may be applied just as easily with other forms of closed testing procedure or combination test, or with different definitions of the final gain function.
Our study of a wide range of examples supports clear conclusions about the benefits of adaptive enrichment designs. If investigators are willing to use either the FF (Fixed Full population) or FS (Fixed subpopulation) design, the additional benefits of an adaptive enrichment design are at best modest for the gain function we have considered. However, the FS design may not be a realistic option: there could be differing opinions about the likely treatment effect in the subpopulation  2 or, within the wider development program, there may be good reasons for wanting to learn about the new treatment's efficacy in the full population. Then, if the FS design is not an option, there are plausible prior distributions for under which the AE is clearly superior to the FF design.
A positive feature of AE design that is not captured in our gain function is its selectivity. Suppose 1 is high but 2 is close to zero. If rejection of H 03 : 3 ≤ 0 leads to the new treatment being made available to the full patient population, it would be given to patients in  2 for whom the control treatment is just as good. If 2 = 0, the term 3  3 in the gain function (8) is equal to 1  3 and this neither rewards nor penalizes giving the new treatment to patients in  2 . The results in Tables 2 and 3 show the AE design to have higher values of P( 1 ) and lower values of P( 3 ) compared to the FF design, indicating that when 2 is low the AE design is more likely to find a treatment effect only in  1 .
Our results have illustrated a general weakness of adaptive designs that decisions about adaptation are based on interim data which provide only limited information about the true treatment effects. The results in Table 2 for the FS and FF designs show clear benefits to drawing patients from the most appropriate subgroups when the value of is known. However, in the examples of Table 3 and the examples with higher prior variances in Table 4the AE designs must make enrichment decisions under highly variable posterior distributions of at the interim analysis. A possible remedy to this problem in making the enrichment decision is to use additional information from other endpoints or biomarkers that can be assumed to respond in the same way as the primary endpoint to the treatments under investigation.
We have presented methods for a study in which there is just one subpopulation of special interest. These methods can be generalized to the design of trials with multiple subpopulations, possibly nested with the treatment effect increasing as the size of the subpopulation decreases. Then, given a multiple testing procedure that controls FWER, a suitably defined gain function and a prior distribution for the vector of treatment effects, our simulation-based approach may be used to find the optimal enrichment decision at an interim analysis. However, more computation will be needed to find the full optimal design as the dimensionality of the problem increases with the number of subpopulations.
The gain function (8) may be adapted to reflect the process of drug approval. Suppose, for example, H 03 : 3 ≤ 0 is rejected on the strength of a large positive estimate of 1 and a much smaller estimate for 2 . While a regulator may not require formal rejection of the null hypothesis H 02 : 2 ≤ 0 at the 0.025 significance level, some minimum threshold for an estimatê2 may be required in order for the treatment to be approved for the full population, and for health care providers to agree to pay for this treatment. Such a requirement can be reflected in the gain function G( , X), where the data in X includes estimates of 1 and 2 . Rather than stipulate a particular gain function for all applications, we recommend that investigators determine the appropriate gain function for their specific trial, then our methods can be used to optimize over adaptive enrichment designs and to compare the resulting design with other, nonadaptive options.

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.

APPENDIX A. STRONG CONTROL OF FWER IMPLIES A CLOSED TESTING PROCEDURE
Suppose a multiple testing procedure  with n null hypotheses provides strong control of the FWER at level . We shall show that  can be represented as a closed testing procedure . Suppose the null hypotheses are stated in terms of a parameter vector , then strong control of the FWER implies that P (Reject at least one true null hypothesis) ≤ for all .
Suppose the ith null hypothesis is H 0i : ∈ A i . Denote the observed data by X and suppose  rejects H 0i if X ∈ i . We shall use the rejection regions i to define a closed testing procedure  which gives the same overall decisions as .
We first define level tests of the individual hypotheses H 01 , … , H 0n . For each i ∈ 1, … , n, the test of H 0i rejects its null hypothesis if and only if X ∈ i . To see that this gives a level test of H 0i , suppose ∈ A i , then P (Reject H 0i ) = P (X ∈ i ) ≤ P (Reject at least one true null hypothesis) ≤ , by applying (A1) with ∈ A i . Now consider an intersection hypothesis H I = ∩ i ∈ I H 0i , where I is a subset of {1, … , n}. Our level test of H I , rejects H I if X ∈ ∪ i∈I i .
To see this gives a level test of H I , suppose H I is true, so ∈ ∩ i∈I A i , then P (Reject H I ) = P (X ∈ ∪ i∈I i ) ≤ P (Reject at least one true null hypothesis) ≤ , by applying (A1) with ∈ ∩ i∈I A i . The closed testing procedure  is formed by combining the level tests of individual and intersection hypotheses in the usual way. Thus, the null hypothesis H 0i is rejected overall if the level tests reject H 0i and every H I for which i ∈ I. It is easy to check that the procedure  rejects H 0i overall if and only if X ∈ i , and thus the two procedures  and  always reject exactly the same set of hypotheses.
Although the above construction is quite simple, we are not aware that this result has been noted previously. An implication in our application is that we lose no generality by restricting attention to methods based on closed testing procedures. Of course, the choice of closed testing procedure remains. In our case, it is natural to base the level test of H 01 on̂ (   1) 1 and̂ (   2) 1 and the level test of H 03 on̂ (   1) 3 and̂ (   2) 3 , so we see it is the method of testing the intersection hypothesis H 01 ∩ H 03 that may merit further investigation.

APPENDIX B. DERIVATION OF THE OPTIMAL DECISION RULE
We illustrate the details of our computational method in an example where the decision rule being sought is that depicted in Figure 3A. In finding this rule we start by defining a region A in which (̂( 2 ) will lie with very high probability: in this example we have taken A to be the square (0, 20) × (− 10, 10). We subdivide A into four smaller squares and find the optimal decision at each of the nine vertices of these squares, giving the results shown in Figure 3B. We proceed on the assumption that if a certain decision is optimal at (̂( 2 ) = (c, b), where a < c, we assume this decision is also optimal at (̂( 1) 2 ) = (d, b) for all a < d < c. Applying this assumption in our example, we see that it is optimal to enrich for all values (̂( 1) 2 ) in the top right-hand square, so we record this conclusion and make no further calculations for points in this square. The other three squares need further work: we subdivide each of these into four smaller squares and find the optimal decision at each new vertex. The results of these steps are presented in Figure 3C.
We continue the search iteratively, halving the size of the smallest squares at each step. In the next iteration for our example, we note that five of the 12 small squares in Figure 3C have the same optimal decision at all four vertices and we allocate this decision to the whole square. We subdivide the other seven squares and compute optimal decisions at the new vertices. The information after this step is depicted in Figure 3D. Repeating the same steps in the next iteration produces the results shown in Figure 4A.
If our target is to specify optimal decisions on a 16 × 16, this is the final iteration. To complete the process, we find the optimal decision associated with each of the smallest squares: if the optimal decision is the same at all four vertices this decision is assigned to the square; if not, we find the optimal decision at the square's center point and define this to be the decision for the whole square. Figure 4B shows the results of this last step, while Figure 4C presents the same set of conclusions using the full 16 × 16 grid.
Analysing this algorithm in the most demanding case when the decision boundary is at an angle of 45 • , we find the optimal decision has to be computed at about 14n points in order to determine optimal decisions on an array of n × n small squares. A key point here is that the amount of computation is of order n, even though there are n 2 small squares at the finest level. Since we need to conduct a large number of simulations in finding the optimal decision for each value 2 ), the computational load can still be high-but it is feasible. In our examples we found optimal decisions on a 2 8 × 2 8 or 2 9 × 2 9 array, using samples of size 10 5 or 10 6 from the posterior distribution of in finding the optimal decision at each x 1 = (̂( 2 ).
In the examples we have studied, it has usually been clear from the results that the optimal decision function has the assumed monotonicity property. However, it is possible for this assumption to fail. In that case, the decision boundary may cross one edge of a square twice, then having the same optimal decision at all four vertices of that square does not necessarily mean this decision is optimal throughout the square. In a more conservative version of our algorithm, which guards against this eventuality, we require the same decision to be optimal at all 16 vertices of a 3 × 3 grid of squares before concluding this decision to be optimal over the whole of the central square. The additional computations needed when