Optimizing subgroup selection in two‐stage adaptive enrichment and umbrella designs

We design two‐stage confirmatory clinical trials that use adaptation to find the subgroup of patients who will benefit from a new treatment, testing for a treatment effect in each of two disjoint subgroups. Our proposal allows aspects of the trial, such as recruitment probabilities of each group, to be altered at an interim analysis. We use the conditional error rate approach to implement these adaptations with protection of overall error rates. Applying a Bayesian decision‐theoretic framework, we optimize design parameters by maximizing a utility function that takes the population prevalence of the subgroups into account. We show results for traditional trials with familywise error rate control (using a closed testing procedure) as well as for umbrella trials in which only the per‐comparison type 1 error rate is controlled. We present numerical examples to illustrate the optimization process and the effectiveness of the proposed designs.

the trial being conducted with subjects from the remaining groups only. The U.S. Food and Drug Administration guidance on adaptive designs highlights the use of adaptive enrichment designs as a means to increase the chance to detect a true drug effect over that of a fixed sample design. 12 Master protocols provide an infrastructure for efficient study of newly developed compounds or biomarker-defined subgroups. 13,14 Such studies simultaneously evaluate more than one investigational drug or more than one disease type within the same overall trial structure. [15][16][17] An umbrella trial is a particular type of master protocol in which enrolment is restricted to a single disease but the patients are screened and assigned to molecularly defined subtrials. Each subtrial may have different objectives, endpoints or design characteristics. An example of an umbrella trial is the ALCHEMIST trial, in which patients with nonsmall cell lung cancer are screened for EGFR mutation or ALK rearrangement and assigned accordingly to subtrials with different treatments. 18 In this paper, we study confirmatory trials that allow the investigation of the treatment effect in prespecified nonoverlapping subgroups. In particular, we focus on adaptive clinical trials that allow the modification of design elements without compromising the integrity of the trial. 19 We propose a class of adaptive enrichment designs that use a Bayesian decision framework to optimize the design parameters, such as the trial prevalences of the subgroups, the weights for multiple hypotheses testing, and adaptation rules. A similar framework has been used in References 20-27 for adaptive enrichment trials.
We consider two types of problem. In the first case, we study designs that preserve the familywise error rate (FWER) of the trial using a closed testing procedure to test the null hypotheses of no treatment effect in the two subgroups. This is what is typically required in adaptive enrichment trials where a single treatment is evaluated against a control. In the second case, we show results for umbrella trial designs without multiplicity adjustment. Here, we consider studies made up of separate simultaneous trials, for which it has been argued that no control of multiplicity is needed. 28 Our work, therefore, provides an overarching framework for both adaptive enrichment designs and umbrella trials.
The manuscript is organized as follows: In Section 2, we introduce the designs and distinguish between single-stage designs (Section 2.2) and two-stage designs (Section 2.3), and in Section 2.4 we discuss how to adapt our proposed designs to umbrella trials. In Sections 3 and 4 we present numerical examples. We describe how our methods may be extended to designs with more than two stages in Section 5 and we end with conclusions and a discussion in Section 6.

The class of trial designs
Consider a confirmatory parallel-group clinical trial comparing a new treatment and a control with respect to a pre-defined primary endpoint. We assume the patient population may be divided into disjoint, biomarker-defined subgroups. Given a maximum achievable sample size, n, we aim to optimize the trial design by maximising a specific utility function. Suppose two biomarker-defined subgroups have been identified before commencing the trial. Let 0 < < 1 be the prevalence of the first subgroup in the underlying patient population and 1 − the prevalence of the second subgroup. Let 1 and 2 be the treatment effects, denoting the difference in the mean outcome between treatment and control, in the first and second subgroups, respectively. We consider trials to investigate the null hypotheses H 01 : 1 ≤ 0 and H 02 : 2 ≤ 0 with corresponding alternative hypotheses H 11 : 1 > 0 and H 12 : 2 > 0. In Sections 2.2 and 2.3 we consider confirmatory trials in which strong control of the FWER is imposed. 29 In our discussion of umbrella trials in Section 2.4, we assume multiplicity control is not required.
We consider optimization within a class of designs  that have a single interim analysis at which adaptation can take place. The total sample size is fixed at n with s (1) n patients in the first stage and s (2) n patients in the second stage, where s (1) > 0, s (2) ≥ 0 and s (1) + s (2) = 1. In the first stage, r (1) 1 s (1) n patients are recruited from subgroup 1 and r (1) 2 s (1) n from subgroup 2, where r (1) 1 ≥ 0, r (1) 2 ≥ 0 and r (1) 1 + r (1) 2 = 1. In the second stage, r (2) 1 s (2) n patients are recruited from subgroup 1 and r (2) 2 s (2) n from subgroup 2, where r (2) 1 ≥ 0, r (2) 2 ≥ 0 and r (2) 1 + r (2) 2 = 1, and the values of r (2) 1 and r (2) 2 may depend on the first stage data. Within each stage and subgroup, we assume equal allocation to the two treatment arms (this assumption is not strictly necessary and could be relaxed). Figure 1 gives a schematic representation of the trial design.
The definition of a particular design in  is completed by specifying the multiple testing procedure to be used and the method for combining data across stages when adaptation occurs. We use a closed testing procedure to control FWER, applying a weighted Bonferroni procedure to test the intersection hypothesis. In this procedure, weights are initially set F I G U R E 1 Schematic representation of the three types of trial design. In the single-stage trial, the sampling prevalences of the subgroups are fixed throughout the trial. In standard adaptive enrichment trials, patients are recruited with predefined subgroup prevalences until the interim analysis, at which point a decision is taken to continue with the same prevalences or to sample from a single subgroup. In the Bayes optimal adaptive trial designs that we consider, the sampling prevalences may be changed at the interim analysis [Colour figure can be viewed at wileyonlinelibrary.com] as (1) 1 and (1) 2 but these may be modified in the second stage if adaptation occurs. The error rate for each hypothesis test is controlled by preserving the conditional type I error rate when an adaptation is made. Thus, while we use a Bayesian approach to optimize the design, the trial is analyzed using frequentist procedures that control error rates at the desired level, adhering to conventional regulatory standards.
We follow a Bayesian decision theoretic approach to optimize over trial designs in the class . In assessing each design, we assume a prior distribution for the treatment effects in each subgroup and a utility function 30 that quantifies the value of the trial's outcome. We shall optimize designs with respect to the timing of the interim analysis, the proportion of patients recruited from the two subgroups at each stage of the trial, the weights in the weighted Bonferroni test, and the rule for updating these weights given the interim data.
We summarize the data observed during the trial by the symbol̂, noting that this summary should contain information about the numbers of observations from each subgroup and weights to be used in the weighted Bonferroni test at each stage, as well as estimates of 1 and 2 obtained from observations before and after the interim analysis. We define our utility function to be where 1(.) is the indicator function. By definition, the data summarŷcontains the information needed to determine if each of the hypotheses H 01 and H 02 is rejected. The utility (1) involves the size of the underlying subgroups as well as the rejection of the corresponding hypotheses. Thus, rejection of the null hypothesis for a larger subgroup is given greater weight. If the population prevalence of the two subgroups is not known, a prior on may be added. We note that terms in the function (1) are positive when a null hypothesis is rejected but the associated treatment effect is very small or even negative: this issue could be addressed by multiplying each term by an indicator variable which takes the value 1 if the relevant parameter, 1 or 2 , is larger than zero or above a clinically relevant threshold (eg, Stallard et al 31 where a similar approach is used for treatment selection).
Since the trial design is optimized with respect to the stated utility, it is important to choose a utility function that reflects accurately the relative importance of possible trial outcomes. Furthermore, the definition of utility can be adapted to reflect the interest of different stakeholders, for example, Ondra et al 21 and Graf,Posch and König 24 propose utility functions that represent the view of a sponsor or take a public health perspective.
Let ( ) denote the prior distribution for = ( 1 , 2 ). Then, the Bayes expected utility for a trial design a ∈  is where we have taken the expectation over the sampling distribution of the trial data given the true treatment effects , with an outer integral over the prior distribution ( ). When choosing the prior ( ), it is important to remember that W ( ) (a) represents the expected utility, averaged over ∼ ( ). If an "uninformative" prior is chosen, this will place weight on extreme scenarios, such as large negative treatment effects, which have little credibility. Thus, when considering the Bayes optimal design, it is important to use subjective, informative priors. In some cases, pilot studies or historic observational data may be available to construct the prior distribution.
In this paper, we assume the prior distribution ( ) to be bivariate normal, ( . ( Here, the correlation coefficient reflects the belief about the existence of common factors that contribute to the treatment effects in the two subgroups.

2.2
Bayes optimal single-stage design

Patient recruitment and estimation
Suppose we wish to conduct a single-stage trial, which is the special case where s (2) = 0, usually referred to as a stratified design. For simplicity of notation in this section, we write r j and j rather than r (1) j and (1) j for j = 1 and 2. We assume patients can be recruited at these rates regardless of the true proportions and 1 − in the underlying patient population. In addition, we assume that patients are randomised between the new treatment and the control with a 1 : 1 allocation ratio in each subgroup.
During the trial we observe a normally distributed endpoint for each patient and we assume a constant variance for all observations. For patient i from subgroup j on the new treatment we have X ji ∼ N( Tj , 2 ), i = 1, … , r j n/2, and for patient i from subgroup j on the control treatment we have Y ji ∼ N( Cj , 2 ), i = 1, … , r j n/2. The estimate of the treatment effect (3)

Hypothesis testing in the single-stage design
Consider the case s (2) = 0 and 0 < r 1 < 1. Then̂j . The resulting closed testing procedure is equivalent to the weighted Bonferroni-Holm test and will be generalised to adaptive tests in Section 2.3.
We note that the choice of a closed testing procedure is not restrictive in this setting since any procedure that gives strong control of the FWER may be written as a closed testing procedure. 22,23 Furthermore in the special cases r 1 = 1 and r 2 = 1, where the trial recruits from only one of the subgroups, just one subgroup is tested and only the test of the individual hypothesis is required. These cases are accommodated in our general class of designs by setting 1 = 1 when r 1 = 1 and 2 = 1 when r 2 = 1.

Bayesian optimization
In the single-stage trial we wish to optimize the trial prevalences of each subgroup, r 1 and r 2 , and the weights in the Bonferroni-Holm procedure, 1 and 2 . Given the constraints r 1 + r 2 = 1 and 1 + 2 = 1, we denote the set of parameters to optimize by a = (r 1 , 1 ). Let f (̂| , a) denote the conditional distribution of (̂1,̂2) given for design parameters a. The Bayes expected utility is given by The Bayes optimal design is given by the pair a = (r 1 , 1 ) that maximises the Bayes expected utility of the trial, that is Given our simple choices for the prior distribution and the utility function this integral may be computed directly (see Section S1.2 of Appendix S1). We find the Bayes optimal single-stage trial by a numerical search over possible values of a.

2.3
Bayes optimal two-stage adaptive design

Adding a second stage
Consider now a two-stage design in which data from the first stage inform adaptations in the second stage. The estimate of j for subgroup j based on data collected in stage k iŝ where X j are the mean responses in subgroup j in stage k for the treatment arm and control arm, respectively. Given the value of = ( 1 , 2 ), the first stage estimates are independent with distributionŝ The trial prevalences, r (2) 1 and r (2) 2 , of the two subgroups in the second stage are dependent on̂ (   1) 1 and̂ (   1) 2 but, conditional on r (2) 1 and r (2) 2 , the second-stage estimates are independent and conditionally independent of̂( 1)

Hypothesis testing in the two-stage adaptive design
There is a variety of approaches to test multiple hypotheses in a two-stage adaptive design. [33][34][35][36] We shall use a closed testing procedure to ensure strong control of the FWER at level , as we did for the single-stage design in Section 2.2.2. In constructing level tests of the null hypotheses H 01 , H 02 and H 01 ∩ H 02 we employ the conditional error rate approach. 37,38 Based on a reference design and its predefined tests, we calculate the conditional error rate for each hypothesis and define adaptive tests which preserve this conditional error rate, thereby controlling the overall type I error rate. Consider a reference design in which the trial prevalences of subgroups 1 and 2 and the weights in the weighted Bonferroni test of H 01 ∩ H 02 remain the same across stages, so r (2) j = r (1) j and (2) j = (1) j for j = 1 and 2. In the reference design, tests are performed by pooling the stage-wise data within each subgroup and treatment arm, and using the conventional test statistics, as for the single-stage test. For j = 1 and 2, the pooled estimate of j across the two stages of the trial iŝ , and the null hypothesis H 0j is rejected at level if Z and the conditional error rates for the tests of H 0j are Similarly, the conditional error rate for the test of H 01 ∩ H 02 is See Section S1.1 of Appendix S1 for further details on the derivations of the conditional distributions.
In the adaptive design, if no adaptations are made at the interim analysis we apply the tests as defined for the reference design. Suppose now that adaptations are made and the trial prevalences in stage 2 are set to be r (2) 1 and r (2) 2 with weights 1 and (2) 2 for the weighted Bonferroni test. In this case, we calculate the conditional error rates A 1 , A 2 and A 12 prior to adaptation from Equations (5) and (6). We then define tests of H 01 , H 02 and H 01 ∩ H 02 based on stage 2 data alone that have these conditional error rates as their type 1 error probabilities. Given the updated r (2) 1 and r (2) 2 , Thus, in our level tests, we reject 2 ) and, applying a weighted Bonferroni test with weights (2) 1 and (2)

Two-stage optimization
We denote the set of initial design parameters by a 1 = (s (1) , r (1) 1 , (1) 1 ) and the second-stage parameters by a 2 = (r (2) 2 ) be the vectors of estimated treatment effects in each subgroup, based on the first and second-stage data, respectively, as defined in Equation (4). Denote the conditional distributions of the estimated effects in each stage of the trial by f 1 (̂( 1) | , a 1 ) and f 2 (̂( 2) | , a 2 ) and the posterior distribution of given the stage 1 observations by ( |̂( 1) , a 1 ). Then, the Bayes expected utility can be written as We find the optimal combination of design parameters a 1 before stage 1 and a 2 before stage 2 using the backward induction principle. First we construct the Bayes optimal a 2 for all possiblê( 1) and a 1 . Then we construct the Bayes optimal a 1 given that the optimal a 2 will be used in the second stage of the trial.
Optimizing the decision at the interim analysis and the right-hand side of Equation (7) can be written as Thus, given a 1 and̂( 1) , the Bayes optimal decision for the second stage is the choice of a 2 that maximises For known values of̂( 1) and a 1 , we can find the conditional error rates A 1 , A 2 , and A 12 used in hypothesis testing in stage 2, hence we may evaluate  (̂) for given a 1 ,̂( 1) , a 2 , and̂( 2) . Our choices for the prior distribution and utility function mean that it is quite straightforward to compute W 2 (a 2 , a 1 ,̂( 1) ) for given a 1 , a 2 and̂( 1) . Thus, we are able to perform a numerical search seeking to find the Bayes optimal a 2 .

Overall trial optimization
Having found the Bayes optimal parameters a 2 for the second stage of the trial as a function of (a 1 ,̂( 1) ), we determine a 1 , the Bayes optimal choice for the initial parameters, as We conduct a search over possible values of a 1 to maximize the above integral and find the optimal choice of a 1 . Computing the integral for a given value of a 1 by numerical integration is not straightforward. Instead, we have used Monte Carlo simulation to carry out this calculation for each value of a 1 .

Bayes optimal umbrella trials
We now consider the case of umbrella trials, where it has been argued that no multiplicity adjustment is required as the hypotheses to be tested concern different experimental treatments targeted to different molecular markers or subgroups. 28

TA B L E 1
The scenarios considered in the numerical examples. The term "opt" indicates that parameters were optimized, while "N/A" means the parameters are not applicable. The parameters 1 and 2 are either specified by a prior distribution in which 1 = 2 = or specific values of 1 and 2 are given Since each treatment is assessed separately, an umbrella trial can be viewed a set of independent trials even though they are run under a single protocol. We consider umbrella trials with two subgroups, as in the previous sections. However, without multiplicity adjustment, the hypothesis testing procedure reduces to testing the elementary hypotheses H 01 and H 02 each at level . In applying the conditional error rate approach, only the computation of conditional error rates A 1 and A 2 from Equation (5) is required. Then, with Z (2) 1 and Z (2) 2 denoting the test statistics based on second-stage data only, H 01 is rejected if No test of the intersection hypothesis is performed. Design parameters are optimized with respect to the utility function in Equation (1). To frame the optimization problem in the same way as in the previous sections, the interim decision in a two-stage umbrella trial will optimize only the second-stage subgroup trial prevalences, so a 2 = (r (2) 1 ), while in the first stage we optimize the subgroup trial prevalences and the timing of the interim analysis, so a 1 = (s (1) , r (1) 1 ). In the case of a single-stage umbrella trial, only the subgroup prevalences are optimized, so a = (r 1 ). We have used a normal prior distribution, as defined in Equation (2), in optimizing the design parameters of single-stage and two-stage trials. In the case of two-stage designs, the interim analysis uses the test statistics from the first stage and the prior distribution to perform adaptations and the final tests are performed using the conditional error rate approach.

NUMERICAL EXAMPLES AND COMPARISONS
In this section, we give numerical examples of optimized single-stage and two-stage designs in a range of scenarios. We show results for cases with and without multiplicity correction, referring to these as enrichment and umbrella trials, respectively. Additionally, we illustrate the optimization of the decision rule at the interim analysis. In Table 1, we provide an overview of the scenarios considered and the parameters that are optimized.

Optimal single-stage designs
In studying the impact of the prior distribution on optimized trial design parameters a = (r 1 , 1 ) for single-stage designs, we consider studies where the response variance is 2 = 1 and the total sample size is fixed at n = 700. We assume a multivariate normal prior distribution for as defined in Equation (2) with parameters 1 , 2 , 1 = 2 = and , and we compute optimal designs for a variety of such priors. The FWER in enrichment designs and the per-comparison error rate in umbrella designs is fixed at = 0.05. In Figure 2 we display the effect of the prior SD on the optimal design parameters when the population prevalence of subgroup 1 is = 0.3. We considered prior SDs of 0.02, 0.0632, 0.1, 0.1414, 0.2, 0.3162, and 0.44, corresponding to information from studies with 10 000, 1000, 400, 200, 100, 40 and 20 subjects in each subgroup.
The mean and variance of the prior distribution have a large impact on the optimal design parameters r 1 and 1 . The optimal values of r 1 and 1 and the expected utility of the resulting designs are very similar for enrichment and umbrella designs. If 1 > 0 and 2 = 0, optimal values of r 1 and 1 are larger than 0.3, the population prevalence of subgroup 1, so the design over-samples this subgroup. If 1 = 0 and 2 > 0, the optimal design under-samples subgroup 1. When both 1 and 2 are greater than zero, the optimal design has r 1 < 0.5 and 1 < 0.5, reflecting the fact that it is advantageous to sample more subjects from subgroup 2 and allocate more type 1 error probability to the test of H 02 since = 0.3 implies that P(Reject H 02 ) has a greater weight than P(Reject H 01 ) in the utility function.
In extreme cases where 1 = 0, 2 ≥ 0 and the prior variance is small, the optimal design has r 1 = 0, so only subgroup 2 is sampled. When 1 > 0, 2 = 0 and the prior variance is small, the optimal design has r 1 = 1 and only subgroup 1 is sampled.
In Figure S2, we show the effect of the prior correlation on the design parameters when the prior SD is = 0.2. We observe that the correlation has an impact on the optimal weight 1 for testing the intersection hypothesis, in particular, when the treatment effects 1 and 2 have a high positive correlation, it is better to place most weight on one hypothesis rather than split the weight between the two hypotheses.
In Figures S3 and S4 we present further results for different values of , varying in Figure S3 and in Figure S4. Since the utility to be maximized depends on the population prevalences, the optimal design parameters vary considerably with . We see from Figure S3 that has only a small impact on the optimal value of r 1 when adjusting for multiplicity and no impact at all in umbrella designs where no multiplicity adjustment is made. Figure S4 shows that the dependence of optimal design parameters on is similar to that seen in Figure 2: when the prior variance is large the optimal choices for r 1 and 1 are close to , while for smaller variances the optimal designs depend on the prior means 1 and 2 as well as . The adaptation rules specify the second-stage design parameters a 2 = (r (2) 1 , (2) 1 ) that optimize the expected utility, as defined in Equation (1), given the first stage statistics Z 1

Optimal two-stage designs
(1) and Z 2 (1) . The optimal r (2) 1 and (2) 1 are calculated using the Also shown are the conditional expected utility when the trial proceeds using the optimized values of r (2) 1 and (2) 1 and the increase in conditional expected utility compared to continuing with no adaptation. In each plot, the red circle indicates the 95% highest density region for the distribution of (Z (1) 1 , Z (1) 2 ) when the true treatment effects are 1 = 0.3 and 2 = 0 and the green ellipse indicates the 95% highest density region for the prior predictive distribution of (Z (1) 1 , Z (1) 2 ). The white regions contain values of (Z (1) 1 , Z (1) 2 ) for which the maximum conditional expected utility is below 0.01. In these cases the numerical optimization becomes unstable and optimal values for r (2) 1 and ( Hooke-Jeeves derivative-free minimization algorithm through the hjkb function in the dfoptim package 39 in R. 40 We also calculated the conditional expected utility if the trial continued with no adaptation, so r (2) 1 = r (1) 1 and (2) 1 = (1) 1 , and the plots in the bottom row of Figure 3 show the gain in the conditional expected utility due to the optimized adaptation. In Section S3 of Appendix S1, we present optimal interim rules for further values of .
In Figure 4, we illustrate the procedure for optimizing first-stage design parameters, a = (s (1) , r (1) 1 , (1) 1 ) for an enrichment design or a = (s (1) , r (1) 1 ) for an umbrella design. For each combination of prior parameters and first-stage design parameters a, we generated 1000 samples of first-stage data under treatment effects drawn from the prior distribution. For each first-stage dataset, we found the optimal second-stage design parameters and noted the conditional expected utility using these optimal parameters. We took the average of the 1000 values of the optimized conditional expected utility as our simulation-based estimate of the expected utility for this choice of a. The optimal first-stage design parameters for a given prior distribution are those values of s (1) , r (1) 1 , and in the case of an enrichment design (1) 1 , that yield the highest expected utility. Our results show the impact of the prior distribution on the optimized trial design parameters. The flat lines when s (1) = 0.1 indicate that the expected utility is hardly affected by the choice of r (1) 1 and (1) 1 when the interim analysis is performed early in the trial. When the interim analysis is performed later, the choice of first-stage design parameters is more important. It should be noted that for each pair of prior means ( 1 , 2 ), expected utility close to the overall optimum can be achieved using a wide range of first-stage design parameters as long as the second-stage design is optimized, given the first-stage data.
In Figures 5 and S10 we present optimized values of the first-stage design parameters, s (1) , r (1) 1 , and (1) 1 , given that optimal values of the second-stage design parameters will be used following the interim analysis. The results are similar to those observed for optimal single-stage designs. The prior variance has a large impact on the first-stage optimal design: for smaller variances, interim analyses closer to the beginning of the trial yield a larger expected utility, while with larger variances, interim analyses after around 40% to 60% of the patients have been recruited are preferable. When the prior means are both 0 the optimal design parameters r (1) 1 and (1) 1 are close to the subgroup 1 prevalence . However, if the prior suggests a benefit is more likely in subgroup 1, the optimal design over-samples this subgroup, increasing its trial prevalence and testing weight. Figure S10 shows that, for enrichment designs, the prior correlation has a large impact on the choice of (1) 1 but little effect on the optimal trial prevalences. As for single-stage designs, the optimal values of r (1) 1 are similar for enrichment and umbrella designs. A notable difference is that while the prior correlation has no effect at all on the optimal values of r 1 in a single-stage umbrella F I G U R E 5 Optimized design parameters for two-stage designs and the expected utility, averaged over the prior. Parameters are a = (s (1) , r (1) 1 , (1) 1 ) for enrichment trials and a = (s (1) , r (1) 1 ) for umbrella trials. Results are classified by 1 and 2 , the prior means for 1 and 2 , and by the prior SD = 1 = 2 . The prior correlation between 1 and 2 is fixed at = 0.5 and the population prevalence of subgroup 1 is assumed to be = 0.3 [Colour figure can be viewed at wileyonlinelibrary.com] design, the optimal value of r (1) 1 in a two-stage umbrella design does show a small dependence on . In the case of a single-stage umbrella design, the marginal distributions of̂1 and̂2 do not depend on and thus, with no multiplicity adjustment in testing H 01 and H 02 , the expected value of the utility defined in Equation (1) does not depend on . However, in a two-stage umbrella trial, the optimal choice of r (2) 1 and the resulting conditional expected utility depends on botĥ( 1 ), which depends on , that determines the optimal value of r (1) 1 . It should be noted that the procedures we have described impose a high computational burden. While it is relatively straightforward to optimize the decision at the interim analysis, the overall optimization of the trial is performed using simulations over a grid of values for the first-stage design parameters. More rapid computation of the optimal values may be achieved by using approximations to the utility when extreme first-stage values are observed, for example, if both Z (1) 1 and Z (1) 2 are large and negative, the expected utility is practically zero for all choices of r (2) 1 and (2) 1 . In practice, one may wish to add the option of stopping the trial for futility if extreme negative results are observed at the interim analysis. The methods we have presented can be extended to find efficient designs that incorporate this option by working with a utility of the form 1(Reject H 01 ) + (1 − ) 1(Reject H 02 ) + k s (2) n 1(Stop at the interim analysis), assigning a positive value k to each observation saved by early stopping.

Performance of the Bayes optimal design under specific alternative hypotheses
In this section we consider adaptive designs optimized for a particular prior distribution for = ( 1 , 2 ) but we evaluate their performance under specific values of . We consider trials with a total sample size n = 700, response variance 2 = 1, and population prevalence of subgroup 1 equal to = 0.3. As a benchmark for comparison, we consider a nonoptimized, single-stage design with r 1 = and 1 = 0.5. We derive and assess the performance of single-stage designs for which design parameters r 1 and 1 are optimized as described in Section 2.2, and we derive and assess two-stage designs for which first-stage design parameters and the adaptation rule are optimized as described in Section 2.3. In optimizing designs, we assume the normal prior distribution for presented in Equation (2) with 1 = 0.1 or 0.2, 2 = 0, 1 = 2 = 0.2 and = 0.5. These priors reflects the belief that a treatment benefit is more likely in subgroup 1. The prior SD of 0.2 corresponds to information from a trial with 100 subjects in each subgroup. We evaluate the operating characteristics of the designs for values of 1 ranging from 0 to 0.3 and 2 = 0 or 0.2. This creates scenarios with a treatment effect in only one subgroup when 2 = 0 or with a treatment effect in both subgroups when 2 = 0.2 and 1 > 0. Figure 6 presents simulation results for enrichment trials and Figure S11 presents results for umbrella trials. The plots show the probabilities of rejecting H 01 and H 02 and the average utility at the end of the trial for a variety of combinations of 1 , 2 , 1 , and 2 . For the scenarios considered, we see that optimizing the trial for the assumed priors leads to a substantial increase in the power to reject H 01 as compared to the nonoptimized, single-stage design. However, the optimized designs have lower power to reject H 02 when 2 = 0.2. The optimized designs have a higher average utility than the nonoptimized design when 2 = 0. If 2 = 0.2, the two-stage design optimized for the prior with F I G U R E 7 Interim optimization. The color indicates the expected utility given interim data for each combination of second-stage prevalence r (2) 1 for subgroup 1 and testing weight (2) 1 given the interim data [Colour figure can be viewed at wileyonlinelibrary.com] than the the nonoptimized design. These results are in line with previous studies 41,42 which showed adaptive enrichment designs provide the greatest advantage when a treatment effect is present in only one subgroup.

WORKED EXAMPLE: IMPLEMENTING AN OPTIMIZED ADAPTIVE ENRICHMENT TRIAL
Suppose we wish to compare an experimental treatment to a control in a phase III clinical trial. We intend to use adaptive sample allocation as there is reason to believe the new treatment may only benefit a subgroup of patients. This trial will have a normally distributed endpoint with variance 2 = 1 and, using information from a pilot study with 40 subjects from each subgroup, we construct a prior distribution ( ) for the treatment effects .
The total sample size for the trial is planned to be n = 700 subjects. The population prevalence of subgroup 1 is = 0.3 and a FWER = 0.05 is to be used for the study. Under the above assumptions, the results in Figure 5 (5) and (6), are A 1 = 0.6140, A 2 = 0.0184, and A 12 = 0.3912. At this point, we optimize the second-stage design parameters r (2) 1 and (2) 1 . Figure 7 plots the conditional expected utility as a function of r (2) 1 and (2) 1 on a color-coded scale. The maximum conditional expected utility, obtained using the Hooke-Jeeves algorithm, is at r (2) 1 = 0.314 and (2) 1 = 0.953. We therefore conduct the second stage of the trial using these parameter values.
Suppose, after recruiting the remaining subjects, the second-stage estimates arê(

EXTENDING THE DESIGNS
The methods we have described can be extended to trial designs with more than two stages or more than two subgroups. Suppose K disjoint subgroups S 1 , … , S K are specified and we wish to test the null hypotheses H 0k : k ≤ 0 against the alternatives H 1k : k > 0, where k denotes the treatment effect in subgroup k. In a trial with J stages and a total sample size n, we recruit s (j) n patients in each stage, where s (1) + · · · + s (J ) = 1, and at stage j we recruit r (j) k s (j) n patients from subgroups k = 1, … , K, where r (j) 1 + · · · + r (j) In an enrichment design where control of the FWER is required, a suitable closed testing procedure is defined in terms of the Z (j) k . Then, H 0k is rejected globally at level if all intersection hypotheses involving H 0k are rejected in local, level tests.
An adaptive design can be created by repeated application of the conditional error approach. An initial reference design is stated and when adaptation occurs, the modified testing procedure is defined so as to preserve the conditional error rate of each individual and intersection hypothesis test under the updated design for the remainder of the trial. This updated design becomes the new reference design under which conditional error rates will be calculated at any subsequent adaptation point.
We can consider optimizing the choice of the design parameters s (j) and r (j) k or weights in the tests of intersection hypotheses. The generalization of our earlier approach requires a prior distribution for the treatment effects = ( 1 , … , K ) and a utility function whose expectation is to be maximised. If k is the population prevalence of subgroup k, k = 1, … , K, a natural extension of Equation (1) is In Section 2.3.3 we applied backwards induction to find the optimal design for a trial with two subgroups and two stages. Since the dimension of the state space grows with the number of subgroups and stages, such a direct application of backwards induction may not be feasible more generally. Other methods of optimization can be employed to find efficient, if not globally optimal, designs. For example, in a multistage design one may construct the adaptation rule at each interim analysis assuming the trial will continue without any further adaptation. We note that the optimization process is liable to be computationally intensive and it is important to commit resources to assess trial designs in a timely manner.

DISCUSSION
We have presented a Bayesian decision theoretic framework in which a clinical trial design can be optimized when two disjoint subgroups are under investigation. Our approach has both Bayesian and frequentist elements: the rules for hypothesis testing control the type I error rate and Bayesian decision tools are used to choose the design parameters within this scheme. This allows optimization of the sampling prevalence of each subgroup and weights in a weighted Bonferroni test of the intersection hypothesis, as well as optimal adaptation of these design parameters at the interim analysis. The optimal design maximizes the expected value of the specified utility function, averaged over the prior distribution assumed for the treatment effects in the two subgroups. After focusing on two-stage trials with two subgroups in Sections 2 and 4, we outlined how our optimization framework may be extended to allow more subgroups or stages in the trial in Section 5.
Our results provide insights into how the mean and variance of the prior distribution affects the optimal timing of the interim analysis and the trial prevalences for each subgroup of patients. In practice, it is advisable to consider the sensitivity of the design's efficiency to modeling assumptions in order to create a trial design with robust efficiency.
In contrast to adaptive enrichment designs where recruitment is either from the full patient population or restricted to a single subgroup, we propose sampling from each subgroup at a specific rate which may differ from its population prevalence. We acknowledge that achieving the optimized prevalences in a trial may be challenging: additional screening will be required and over-sampling a particular subgroup may delay a trial compared to an all-comers design. 43,44 If logistical considerations imply that each subgroup is either dropped or sampled according to its population prevalence, our framework can still be used to optimize the other design parameters.
In Section 3.2 we discussed designs with the option of early stopping for futility and how the utility function might be modified to facilitate optimizing such designs. A similar approach could be followed to relax the requirement of a fixed total sample size and allow re-assessment of future sample size at an interim analysis.
We have defined methods for normally distributed observations and a normal prior for treatment effects. While this has allowed us to demonstrate how to construct such designs, it is not a necessary restriction. With normally distributed responses, one could allow a separate response variance for each patient subgroup, placing prior distributions on these variances. In trials with other types of response distribution, including survival or categorical endpoints, standardized test statistics will still be approximately normally distributed if sample sizes are large enough, although nonnormal prior distributions may be appropriate. 45 We assumed the null hypotheses of interest are that there is no treatment effect in each subgroup. Our decision theoretic framework can accommodate other formulations, such as testing for treatment effects in the full population and in one particular subgroup, 8,20,[22][23][24][46][47][48] in which case the stage-wise test statistics for different subgroups are correlated. Care is required to ensure that enrichment designs control FWER when test statistics are correlated but this is not an issue in umbrella trials with separate level tests for each null hypothesis. 31 Although we have focused on hypothesis testing instead, estimating treatment effects after an adaptive trial is also important. 49 Simultaneous or marginal confidence regions for parameters, with or without multiplicity adjustment, can be constructed following a two-stage design. 50,51 Point estimates may be obtained by a weighted average of the treatment effects observed in the first and second stages 11,52 but, due to the sample size adaptations and subgroup selection these estimators may be biased with the bias depending on the specific adaptation rules and the true parameter values. A thorough investigation of estimation for adaptive enrichment designs will be a topic of future research.
Software in the form of an R package is available at https://github.com/nicoballarini/OptimalTrial.