Design and estimation in clinical trials with subpopulation selection

Population heterogeneity is frequently observed among patients' treatment responses in clinical trials because of various factors such as clinical background, environmental, and genetic factors. Different subpopulations defined by those baseline factors can lead to differences in the benefit or safety profile of a therapeutic intervention. Ignoring heterogeneity between subpopulations can substantially impact on medical practice. One approach to address heterogeneity necessitates designs and analysis of clinical trials with subpopulation selection. Several types of designs have been proposed for different circumstances. In this work, we discuss a class of designs that allow selection of a predefined subgroup. Using the selection based on the maximum test statistics as the worst‐case scenario, we then investigate the precision and accuracy of the maximum likelihood estimator at the end of the study via simulations. We find that the required sample size is chiefly determined by the subgroup prevalence and show in simulations that the maximum likelihood estimator for these designs can be substantially biased.


INTRODUCTION
Heterogeneity is frequently observed among patients' treatment response in clinical trials. This is due to various factors such as age, race, disease severity, or genetic differences. The topic of heterogeneity in treatment effects has received some attention in the literature (eg, see related works [1][2][3] ) and graphical methods such as forest plots are routinely used for the purpose of examining heterogeneity in effects (eg, the work of Cuzick 4 ). Ignoring heterogeneity can substantially impact on medical practice. For example, a treatment might work well in some patients but not in others. Naively estimating the treatment effect across all patients will result in a diluted effect for the group that truly benefits from the treatment. At the same time, an ethical issue arises due to delivering a treatment to all patients, whereas some might not expect an effect and will potentially be exposed to harmful side effects. To address these issues, trials that consider (potential) subgroups defined by one or more biomarkers are becoming more popular. In general, a biomarker is some measurable variable that might help to identify distinct groups of patients and some examples include cholesterol levels, genetic variations, or age. A biomarker is considered prognostic if it provides information about the value of some other variable of interest (eg, the primary endpoint of a study), whereas it is called predictive if its value yields information about the treatment effect. In this paper, we will only consider the latter type of biomarkers.
A number of different designs concerning treatment selection and subgroups within the study populations have been proposed. These designs can be categorized by factors such as design setting (confirmatory or exploratory) or methodology (frequestist, Bayesian, or utility/decision function) (see related works [5][6][7]. Additionally, the designs can be categorized into single-stage (fixed sample) designs and multistage (adaptive) designs. Both conventionally utilize multiple testing procedures to test for effects in each of the populations of interest. An overview of different multiple testing approaches for this purpose is given in the work of Alosh et al 6 and the references therein. A single-stage design with one biomarker tests, for example, the null hypotheses, ie, the treatment effect of the full population is zero, ie, H 0F and the treatment effect in the subgroup of interest is zero, ie, H 0S . 5,[8][9][10][11][12] These designs are usually employed for exploratory subgroup analysis in phase II (ie, to identify an interesting subgroup) or for confirmatory subgroup analysis in phase III, examining the treatment benefit of prespecified subgroups. Corresponding multistage designs are constructed either as extensions of group sequential approaches 13 or using combination tests. 14 They can refine the population to either the whole or one or more subgroups at the interim analysis and can allow for early stopping for benefit and lack of benefit (see, eg, other works 5,[15][16][17][18]. The accuracy and precision of the treatment effect estimators in subgroup analysis are also crucial to the development of novel treatments and decisions about treatment implementation. Especially, bias is ubiquitous in designs that select (see the work of Bauer et al 19 ) and, in the designs considered here, the bias can come from selecting which (sub)population should be studied further or from selective reporting promising results even in a simple fixed sample design. A variety of papers on treatment effect estimation in the related problem of trials with treatment selection have been published. Approximate bias-correction estimators for single-stage designs for normal endpoints are discussed in the works of Shen 20 and Stallard et al, 21 uniformly minimum variance conditional unbiased estimators for two stage designs have been proposed by Cohen and Sackrowitz, 22 and further extensions are published in the works of Bowden and Glimm 23 Sill and Sampson. 24 Shrinkage estimators have been discussed in the work of Carreras and Brannath, 25 whereas approaches to construct confidence intervals are described in related works. [26][27][28] Time-to-event endpoints are considered in the work of Brückner et al. 29 In contrast, rather limited literature addresses estimation issues in clinical trials with subpopulation selection. For single-stage designs, Rosenkranz 30 proposed a bias-adjustment method employing bootstrap techniques to calibrate the estimates upon general distributional assumption on outcomes. For multistage designs, Kimani et al 31 proposed two estimators, ie, one is a naive estimator using a weighted average of per-stage means and prevalences for each subgroup and the other is a uniformly minimum variance conditional unbiased estimator derived by the Rao-Blackwell theorem. They assessed the performance under several situations, such as different values of prevalence and treatment effect of one subpopulation, and also suggested which estimator should be used according to what population is selected at Stage 1. In addition, Magnusson and Turnbull 16 focused on the designs rather than estimation, though they outlined an extended bias-reduction algorithm proposed by Wang and Leung 32 in which uses double bootstrap methods 33 to adjust ML-estimates and build bootstrap confidence interval.
Despite some contributions on estimation, the aforementioned papers do not provide a complete overview of the maximum likelihood estimator (MLE) under various designs and lack exploring the estimator performance in further conditions. Rosenkranz's 30 simulation work on single-stage designs implicitly regarded the MLE only in circumstances with few different treatment effects for subgroups and thresholds used in the selection rule. Kimani et al 31 considered two-stage adaptive seamless designs, selecting subpopulation based on the Stage 1 data but not allowing early stopping, and they only assessed estimators with selection but without reporting promising results. The multistage designs of Magusson and Turnbull allow to select multiple subpopulations if the estimates of treatment effects are above certain thresholds at Stage 1.
In this paper, we discuss a framework to design single and multistage design that select subgroups. We illustrate the design properties when selection is based on the maximum statistic and comprehensively evaluate the properties of the MLE for these designs. Note that selecting on the basis of the maximum statistic is the worst case for both type I error (provided that the number of hypothesis remains the same) and bias and hence of particular interest. In Section 2, we derive a subgroup selection design that selects groups based on the maximum test statistic. Section 3 describes a simulation study in which different general design scenarios are evaluated and the bias and MSE of the corresponding MLEs are derived. In Section 4, we remark on the designs with different selection rules, then summarize the results of the simulation study and discuss its implications for future work.

DESIGNS
In this section, we first define the basic setting and notation and then provide general ideas for designs with subpopulation selection based on the maximum test statistic.

Basic setting and notation
Assume J mutually disjoint subpopulations are in the full study population (F ) and denote the prevalence of the jth subpopulation (S j ) by j , where j = 1, … , J and ∑ = 1. The sample size of each subgroup is fixed as a proportion of the total sample size depending on the respective prevalence. We use n j to denote the sample size in subgroup S j and more generally use subscripts to denote groups and treatments and superscripts for stages. We consider a normally distributed endpoint with mean j,l with j = 1, … J and l = T, C, where subscript T corresponds to the treatment group and C to the control group. Additionally, we assume a common variance, ie, 2 , across subpopulations.

Single-stage design
For a single-stage design, the test statistics used for selection and decision are distributed as ) .
Note that we use the (unnecessary) superscript (1) for consistency with the multistage notation used later.Ȳ (1) ,T andȲ (1) ,C are the sample means of the treatment group and of the control group within S j , respectively. The true treatment difference in S j is denoted as j = j,T − j,C and I (1) ,C ) is the information level for S j . This further simplifies to 1∕(2 √ 1∕n (1) ) when the assumed treatment allocation ratio is 1:1, where n (1) is the total sample size of S j until the end of Stage 1.
Considering a composite population S  + , combining two subpopulations S  and S  (where  ,  ⊆ {1, 2, … , J},  ∩  = ∅ ), the test statistics are distributed as  + ,C are defined as before but the observations are from the combined treatment group and the combined control group of the united subpopulation S  + . The true treatment effect size and the information level of S  + are  + =  + ,T −  + ,C and I (1)  + = 1∕( ) for equal allocation. Additionally,  + = (   +   )∕(  +  ). Note that, if  and  are complementary, their composite population S  + is the full population F and then the subscript of the aforementioned notations are replaced with f. If  and  have an individual element for each, such as {1} and {2}, we simplify the notation of  +  as 1 + 2. This notation simply denotes the union of S  and S  , and it does not necessarily imply that one is nested in the other.

Multistage design
For multistage designs, the test statistic based on the accumulated data at the end of stage k (k ≤ K, the total stage number) for S  is denoted by where the superscript 1:k refers to a quantity calculated based on the accumulated data at the end of stage k; therefore, is the accumulated information level defined accordingly as 1∕( √ 1∕n 1∶k  ,T + 1∕n 1∶k  ,C ).

Designs considered
We consider designs that control the family-wise error rate (FWER) at level in the strong sense 34 and the set of hypotheses to be tested where  is the index set corresponding to the subpopulations considered and can index nested groups. For instance, if we consider subgroup 1, subgroup 1 and 2, or the full population being of interest,  = {1, 1 + 2, }.

Single-stage designs
To select, we use the maximum of the test statistics among Z (1) s , s ∈  for population selection. Its implication and other selection rules will be discussed later. In the evaluation of the operating characteristics, we consider the case where population selection is undertaken first and only subsequently the corresponding hypothesis being tested. The testing procedure is making a decision about rejecting H 0w if Z (1) w ≥ C , where w is a realized value of the random variable W and refers to the event that subpopulation S w is chosen. Z (1) w is the selected test statistic for S w , and C is the corresponding critical value found to ensure the FWER in the strong sense.
The crucial element to find the appropriate critical value and sample size is the density of the joint distribution of the selected test statistic Z (1) W and the selected population index W. While the subsequent results are derived on the basis of selecting based on the maximum statistic, other selection rules can equally be implemented. Using a different rule results in a different density and, for illustration purposes, we also provide the resulting distribution for selecting any populations whose estimated effect exceeds a prespecified value, ie, , in the Supplementary Materials S.6. The joint densities p Z (1) W ,W (z (1) w , w; ), w ∈  govern the probability whether to select S w and to reject the null hypothesis H 0w (where is a configuration of all mutually disjoint subgroup treatment effects 1 , 2 , … , J ). It can further be decomposed as Consequently, the joint densities of Z (1) W and W can be represented as where denotes the standard normal density; and Ψ ⧵w (·, … , ·; ) is the cumulative distribution function of the || − 1-dimensional normal distribution conditional on Z (1) w under a specified configuration of treatment effects , where || is the cardinality of . The covariance matrix depends on whether subgroups are nested or not (see examples in Supplementary Materials S.2 and S.3). The cumulative distribution function specifies Pr(W = w|Z (1) w = z w ; ). It is noted that (1) is similar to the integrand of equation (4) in the work of Spiessens and Debois, 9 where two coprimary analyses are performed on the full population and a subgroup, and the significance level for F is prespecified.
Using an iterative search, C can then be found using the following inequality: where = 0 denotes the global null hypothesis H 0 , 1 = 2 = … = J = 0. Note that finding the critical value under this setting implies weak control of the FWER. Following the work of Magirr et al, 35 it can be shown, however, that weak control implies strong control since 1 = 2 = … = J = 0 maximizes the type I error when selection is based on the maximum. Similarly, assume an alternative hypothesis that exactly one subgroup (say S w , w in ) has nonzero positive effect size, ie, , but others have none is true, the required total sample size for the full population n (1) can be found using the aforementioned critical values, a desired effect, and a specified power level, ie, 1 − . The related equation is where a denotes the alternative hypothesis, a vector of size J whose elements are all 0 except for the wth element, which is . The desired n (1) is obtained by iteratively increasing the sample size until Equation (3) holds. Note that only rejection of the hypothesis with the truly largest effect is considered in this power requirement. Similar considerations can be used to find the power to reject any false null hypothesis (see Figure 1 for an example).
We have derived the aforementioned formula here for consistency, as for the multistage designs considered in the following, only the selected subgroup continues to subsequent stages.

Multistage designs
The multistage designs we consider follow similar procedures as the aforementioned single-stage designs. Population selection is performed at the first interim analysis, but any population in  can be selected. We consider the case where data after Stage 1 are enriched so that the total sample size in the trial remains fixed but the sample size of subgroups that have not been selected is reallocated to the remaining populations. Suppose the selected population is S w , the difference is that, at stage k, the testing procedure stops by rejecting the procedure continues to stage k where C u k , and C l k are the corresponding upper and lower stopping boundaries at stage k.
Two elements are required for appropriate stopping boundaries and stage-wise sample sizes. The first is the joint density of (Z (1) W , W), as shown in (1). The second element is the density of the conditional distribution of the test statistics Z 1∶k w (with accumulated data until stage k) given its precursor Z 1∶(k−1) w at stage k − 1. We denote this conditional density by ) and its general mathematical form is given in Supplementary materials S.4. The stage-wise density comprising of the two elements can then be used to determine the probability of stopping for efficacy or for futility at stage k. For example, the stage-wise densities at Stage 2 with different values of W are specified as Then, given = Θ 0 (ie, under the global null hypothesis), the probability of early stopping at Stage 2 (either for lack of effect or early rejection) for the subgroup S w can be calculated as where the integral bounds signify that the design continues after Stage 1 but stops at Stage 2 for efficacy. The conditional function p w,2|1 (z 1∶2 w |z (1) w ; ) is used to calculate stopping probability at Stage 2, given that the design does not stop at the preceding stage. Similarly, the stage-wise densities at stage k are the product of the expression in (1), multiplying the factor ). The value of the k-fold multiple integral within the integrand region defined by stopping boundaries before stage k + 1 is the early stopping probability at stage k. Each conditional density p w,m|m−1 (z 1∶m w |z 1∶(m−1) w ; ) with its respective integral bound controls the probability of whether the design stops or continues, given that the design has proceeded at the previous stage.
To find boundaries that ensure FWER control, an iterative search over the stopping boundaries is conducted based on the following inequality: where the integration region A k where 0 denotes the globe null hypothesis. We define z 1∶0 w = z 1∶1 w and therefore p w,1|1 (z 1∶1 w |z 1∶1 w ; ) = 1. Note that this yields only one inequality, whereas C l 1 , … , C l k and C u 1 , , … , C u K , are all unknown. To overcome this, we set them to follow a specific functional form, where C l k = C u K , for the K stage design. For example, when using the O'Brien Fleming (OBF) 13,36 type stopping boundaries, C u k , = C OBF (K, ) √ K∕k and C l k is a certain function of k. In addition, the calculations in (5) assumes that the futility bounds are binding. For nonbinding bounds, one can simply set the lower bounds to −∞.
As before, (5) implies weak control of the FWER but also guarantees strong control following the arguments in the work of Magirr et al. 35 Suppose an alternative hypothesis of the form w = > 0 for exactly one element (say w) in  and w * = 0 ∀w * ≠ w ∈ , is true. Then, under this alternative hypothesis, the aforementioned critical values, and specified power, the stage-wise total sample size for the full population n (k) can be found to satisfy the following inequality: where the configuration a has an nonzero positive effect on the wth element but the other J − 1 elements are zero. Detailed derivations of (5) and (6) are provided in Supplementary Materials S.1 and the design details of two-stage designs with two subgroups (considering selection of S 1 or F) in Supplementary Materials S.5.

An illustrative example
The Dose Ranging Efficacy And safety with Mepolizumab in severe asthma (DREAM) trial 37 investigates, among other endpoints, the effect of mepolizumab on exacerbations and forced expiratory volume in 1 second (FEV 1 ). Subsequent secondary analyses of the trial data 38,39 find that the treatment effect of mepolizumab depends on the baseline levels of eosinophil and suggests that only patients with blood eosinophil levels of more than 150 cells per L receive benefit from the treatment. Suppose that, on the basis of these exploratory findings, we wish to embark on a prospective evaluation of the claim that mepolizumab results in meaningful improvements only for patients with baseline levels of eosinophil of 150 or more cells per L in the blood. We will use a change in FEV 1 from baseline to 90 days, modeled as normally distributed as the primary endpoints although the same arguments hold for other endpoints such as exacerbations. Additionally, we suppose that the prevalence of each group (below and above 150 cells per L blood) is 50%. Following the work of Santanello et al, 40 we assume that the standard deviation is 0.72L and consider a reduction of FEV 1 of 0.23L as the minimum clinically relevant treatment difference, and consequently seek to power our evaluations for this effect.
Three different evaluation strategies are considered, ie, (i) running two separate studies in each of the two subgroups, (ii) a single-stage study with one subgroup versus the full population (see Section 2.2.1), and (iii) a two-stage enrichment design where the best performing group is selected at the halfway point and early stopping using O'Brien and Flemming bounds 36 are used (see Section 2.2.2). For each of the three designs, we consider a type I error per study of 2.5% and require a power of 80% to reject any false null hypothesis. Furthermore. we assume that 25 patients are recruited per month and that it takes two months to conduct the interim analysis for strategy 3.
A summary of the characteristics of the different strategies is given in Table 1. The strategy using two separate studies requires just over 600 patients to be recruited, whereas the single-stage design with two groups does need almost 70 patients more. The reason for this is that no attempt has been made in the first approach to control the FWER. If we were to correct for multiplicity for the separate studies using a Bonferroni correction, the required sample size would increase to 748 patients. Using a two-stage selection design allows us to reduce the required sample size even further to around 550 patients, a reduction of 10% and 30% as compared to the uncorrected and multiplicity corrected separate study strategy, respectively. Additionally, the two-stage design does investigate more patients in the group that is truly benefitting from treatment, which is one of the reasons for the reduction in required sample size. Besides the reduction in sample size, running a single study rather than two separate ones does also yields organizational advantages. The main drawback of this approach is that the duration of the study is increased by almost nine months should the subgroup be selected (although a small reduction in the duration is expected if the full population is selected at interim).
Note that, in addition to the advantages illustrated earlier, the FWER in the two-stage enrichment design is controlled for the worst-case situation in terms of selection and hence other selection rules can be used without error rate inflation.

Alternative designs
We have illustrated how to obtain critical bounds and sample size for general enrichment designs earlier. Here, we discuss alternative designs considering different type-I error and power configurations.

Significance levels and stopping boundaries.
An alternative to specifying the design and corresponding stage-wise levels via the boundaries is to specify marginal significance level k to each stage k (where ∑ k k = ) and use an error spending approach as used in classic group sequential designs. 13 Such considerations affect the way we find stopping boundaries where the same boundaries are shared by all the populations considered. More specifically, based on the following inequality (7), it is required to search the critical value used in A k − 1 first under the upper limit of k − 1 (where the subscript of the upper bounds is changed accordingly). Then, substitute those critical values for the associated bounds used in A k under the upper limit of k for finding the remaining critical values and so on Note that there are several ways to determine the lower stopping boundaries; for example, one could set symmetric values with respect to the upper critical values, or simply set 0. One can further prespecify the marginal significance levels for || − 1 specific populations at each stage. One example of taking this consideration can be found in the work of Spiessens and Debois, 9 although they only consider single-stage designs. Such design features may lead to different stopping boundaries for all the populations included in .
Incidentally, for two-stage designs, if early stopping is not considered at stage 1 (that is, the stage-1 data is only used for population selection), then the first bound of integration in Equations (5) and (6), ie, A k , is ( −∞, ∞), where k > 1. Meanwhile, the upper bound C u 1 , 1 of A 1 is defined as ∞ and therefore the integral (1) w is 0. Such designs are the same as the two-stage adaptive seamless designs used in the work of Kimani et al. 31 Power.
The power of the designs in Section 2.2 is defined as the probability to detect the treatment effect of the population of interest under H a . Alternatively, we can define power to detect any treatment effects wherever they are from a set of specific subpopulations. Such change leads the total sample size for F to be different because of its influence on Equation (6), which is the basis of searching n (k) . Moreover, the equation becomes where  * is the subset of  and contains the specified subpopulations of interest. Take an example that, if  = {1, } and  * = , Figure 1 shows the resulting total sample sizes n (1) in a single-stage design, corresponding to different prevalence values of S 1 , under different definitions of power. The left panel is computed to have power 1 − for selecting the subpopulation with the largest true effect and rejecting the corresponding null hypothesis, whereas the right panel considers any correct rejection. Under the left power definition, the required sample size is large when the prevalence of the subgroup with a positive treatment effect is small as the number of patients having said effect is (relatively) small. As the prevalence 1 approaches 1, n (1) increases again as the effect of the subgroup dominates the effect in the full population and differentiating between the two populations becomes more difficult. In contrast, n (1) always decreases under the definition of power to detect 1 > 0 or f > 0. Since the effect sizes for S 1 and F are close, it is difficult to select the correct subgroup and thus large sample sizes are needed. The reason that the behavior of n (1) is always decreasing for larger prevalences in the right panel is that there is no restriction on selecting a prespecified population and reporting the efficacy. The decreasing pattern can be similar to that using the closed testing procedure 41 in a single-stage design, where the total sample is available for investigating any subpopulation without considering selection. Note that all the patterns observed in Figure 1 emerge in a case of multistage designs as well (not shown in this paper).

ESTIMATION ASSESSMENT
In this section, we report a simulation study assessing the properties of MLEs. Note that, in the reported figures, different scales for the y-axes are used to highlight patterns.

Simulation setup
In our evaluations, we specify the FWER, ie, , as 0.025 and set the sample size for each scenario so that the power of the design is 1 − = 80%. Our alternative hypothesis is that the treatment has an effect of 0.5 in S 1 , whereas the effect of the treatment is zero for all other subgroups. Therefore, the power aims to detect the nonzero effect in S 1 (that is to reject H 01 ) once the first subgroup is selected. The assumed common variance across subpopulations, ie, 2 , is set to 1 and we use 1 000 000 simulation runs. The designs we consider are a single-stage design with two subpopulations (Design 1), a single-stage design with three subpopulations (Design 2), and a two-stage design with two subpopulations and three subpopulations (Design 3 and Design 4, respectively), with an OBF upper stopping boundary and a fixed lower boundary of zero is used. We calculate the stopping boundaries and the total sample sizes for F based on (2) and (3) for single-stage designs (and (5) and (6) for multistage designs). The sample sizes and critical values for each of the designs are given in Appendix A (Table A1-A2). Based on these four designs, several scenarios are investigated, altering the design features such as prevalence.
Denotêas the naive MLE (that is not accounting for selection) for the parameter , then̂and̂s represent the MLEs for the treatment effect of F and S s , respectively. The estimates can be calculated by as performance measures for estimation assessment. As the sample size for the full population satisfies the aforementioned power requirement and varies across different prevalence, a standardized scale is used in the assessments (readers are referred to Supplementary Materials S.7 for details on the standardization). In our subsequent evaluations, we will consider three situations. Firstly, we consider the treatment effect estimator regardless of the population being selected or the hypothesis test being significant. Secondly, we consider only the estimators of the selected populations, which is expected to result in selection bias. The third situation considers reporting bias and, for this, we only consider only the treatment effect estimates of the selected population if the corresponding hypothesis test is significant. Implicitly, we are therefore considering that the outcome of a study is only reported (published) if it was significant. Note that, in the evaluations to follow, we refer to the selection bias as Select S w and the reporting bias as Select S w + Reject H 0w , where w in  specifies the population chosen through a selection rule. In addition to the bias and MSE depending on which subgroup has been selected, we also report the family-wise (FW) bias and MSE, ie, the bias and MSE averaged over all possible selections.

Scenarios for Design 1
Scenarios here cover different prevalence values of S 1 , 1 varying from 0.05 to 0.95 in increments of 0.05. We illustrate the assessments for the scenarios under three configurations of different values of 1 and 2 in Figure 3-4. Their horizontal axes are for the prevalence of S 1 , 1 , and the vertical axes of the row-wise panels are for standardized bias, standardized √ MSE, and simulation proportions (%). Figure 2 presents the estimation assessment of̂and̂1 under the assumption of 1 = 0 and 2 = 0. As expected, we do not see any bias when no selection is undertaken as well as constant standardized MSE, ie, a pattern that is repeated throughout all other simulations. Additionally, the selection probability is constant at 50% due to the equal effect in both subgroups. The selection bias is largest when the prevalence in the subgroup is smallest with a matching pattern for the standardized MSE. The reporting bias and MSE follow the same pattern although at a markedly increased level. Figure 3 considers the case when 1 = 0.5 and 2 = 0. Considering the selection probabilities first, we find that, as per design, there is an 80% chance to select population 1 correctly and reject the corresponding hypothesis. The selection  probability of the full population increases as the prevalence increases as the effect in the full population gets larger as the subpopulation contributes more toward it. At the same time, the chance to also reject the hypothesis also increases. The selection and reporting bias in the full population estimate is largest when the prevalence in the subpopulation is smallest and then steadily decreases toward zero. The size of the bias is well over 0.5 standard errors for almost all prevalences and hence should be considered important although the incorrect selection in itself is not very common in this case. For the full population, the bias dominates the MSE and hence the MSE follows the same pattern.
Focusing the attention on subpopulation 1, we find that bias is present, although it is of much smaller magnitude (selection bias at most 0.1 and reporting bias at most 0.35 standard errors) than for the full population (up to over 2 standard errors). The selection bias is maximized at a prevalence of around 0.75, whereas it is largest for a small prevalence for the reporting bias.
When both treatment groups have the same effect, 1 = 2 = 0.5 (Figure 4), we observe that, almost always, the full population is selected and only for large prevalences of the subpopulation (> 50%) we obtain notable selection probability for the subpopulation (up to 20%). As a consequence of this, we obtain no estimate of the bias and MSE for the subpopulation for low prevalences. The bias in the estimate in this population is potentially very large (> 3 standard errors) but drops quickly toward zero as the prevalence increases. In this setting, it is also notable that the selection bias is virtually identical to the reporting bias as very large observed effects are necessary to select the subpopulation in the first place.
The patterns for the full population are somewhat more distinct as no bias is observed for small prevalences because it is always the full population that is selected. The bias in this case is, however, very small even in the worst-case situation (prevalence of around 0.75), where the reporting bias is less than 0.1 standard errors and the selection bias is even smaller.

Scenarios for Design 2
Scenarios for Design 2 regard to select a population among S 1 , S 1 + 2 , and F under different configurations of 1 , 2 , and 3 . Our focus here is to assess the MLEŝ1,̂1, and̂under 1 = 0.5, 2 = 0, 3 = 0 under the population selection rule given by This rule is one variant of the maximum statistic rule and sequentially decides which population to be selected. The results for other configurations of 1 , 2 , and 3 are provided in Tables S.1 The results in Table 2 shows that, in this case, the correct population is selected most of the time (> 80%) due to the design constraint to obtain 80% power. The selection bias when selecting the correct population is small at < 0.1 standard errors and even the reporting bias is only modest at 0.27 standard errors. The selection and reporting bias when selecting the incorrect population are notably larger in this instance resulting in biases up to 1.3 standard errors. The bias is largest for the full population as the true underlying effect in this group is at 0.167 smallest among all populations and hence a rather unusual sample is required for its MLE to be the largest.

Scenarios for Design 3
The investigation presented here concerns Design 3, a two-stage design and we focus on 1 = 0.5 and 2 = 0 here, whereas the results for other configurations are given in Figures S.1-S.6 of Supplementary Materials S.8. Figure 5 shows the results of the estimator for the full population. The top row corresponds to standardized bias, middle row to standardized √ MSE, and the bottom row to the probability of selecting the full population. The first column is associated with the estimators that stop at Stage 1, the second considers only trials that reach Stage 2, whereas the final column corresponds to the estimator irrespective of when the trial was stopped. In addition to the selection bias and the reporting bias, we also consider the estimator irrespective of the reason for stopping (green triangle) in the figure.
The reporting bias is potentially very large (up to three standard errors for Stage 1 only and up to two standard errors for Stage 2) and is the largest when the prevalence of the subgroup is small and subsequently decreases. When only considering studies that select the full population and stop at Stage 1, it approaches zero, whereas the bias does in fact become negative for trials that stop at the second stage. The overall estimator is, however, always positively biased, showing a very similar pattern as the Stage 1 cases only. The selection bias overall and, for Stage 2, only follows the same pattern as the reporting bias, whereas it does show an inverted U-shape for Stage 1 only, which is maximized at a prevalence of around 0.5. The bias in the estimator that only considers stopping at Stage 1 for any reason follows the same pattern as the selection bias, although the bias is smaller. It is noteworthy that, although substantial bias is exhibited under some situation, the probability of reaching these (eg, selecting the full population and stopping at Stage 1) are very rare. The standardized √ MSE appears like that in standardized bias except for the second stage. In those exceptional cases, the MSE (for    selection, reporting, and regardless of selection) decreases at a different rate before inflating substantially at a prevalence of 0.8.
Considering the findings for the estimator of the first subpopulation, ie,̂1 (Figure 6), the results exhibit similar patterns in many circumstances in Figure 5. When stopping the trial at the first stage, the estimator is largely biased for prevalences up to 0.6. The reporting bias subsequently decreases from two standard errors, whereas the selection bias is more moderate at around 1 SE. All the MSE (regardless of any circumstances) decreases to 0.9 SE from 2 and is close one for larger prevalances larger 0.7. As most of the time, the subpopulation is selected correctly, the selection bias and the bias considering all studies that stopped at Stage 1 are very similar and the MSE, meanwhile, is near one standard error. The estimators considering only trials that stop at Stage 2 are almost unbiased for small and moderate prevalence but can exhibit a large negative bias when the prevalence is large. The MSE is close to 1 SE for most of prevalences but becomes very large beyond a prevalence of 0.7. The overall estimator is, however, positively biased (for both selection and reporting) for all prevalences and shows an inverted U-shape with a maximum bias of about 0.3 SEs for a prevalence of 0.6. Its MSE conditional on selection or no-selection appears different from that considering reporting before a prevalence of 0.7. The estimator thereafter performs similarly in MSE with a small U-shape under 1 SE.
The FW bias and MSE for this design with 1 = 0.5 and 2 = 0 are given in Figure 7.

Scenarios for Design 4
Scenarios for Design 4 is the two-stage counterpart of Design 2 for selecting a population among S 1 , S 1+2 , and F under different configurations of 1 , 2 , and 3 . The investigation here focus on assessing the maximum likelihood estimatorŝ1, 1+2 , and̂under 1 = 0.5, 2 = 0, 3    under the same stopping boundaries and sample sizes (n (1) = 335) found based on Design 4 with the maximum statistics selection rule, the configuration of treatment effects ( 1 = 0.5, 2 = 0, 3 = 0) and subgroup prevalences being 1/3. Table 3 shows the results of the estimators for the first subgroup, the combined subgroup, and the full population. The standardized bias, standardized √ MSE, and simulation proportions are presented in the trials that stop at Stage 1, reach Stage 2, and are irrespective of which stopping stage.
Considering the trials irrespective of stopping, we observed the correct population is selected in the 80% of simulations due to the design requirement of 80% power. The bias is found positive for all the overall estimators and varies widely (smallest at 0.05 and maximum up to 1.2 standard errors). The selection and reporting bias when selecting the correct population are the smallest (less than 0.3 standard errors), but larger when selecting the incorrect population (particularly for the full population). All the standardized MSE are larger than one standard error but only up to a moderate size of around 1.3. While selecting the correct population or rejecting the null hypothesis, the estimator for the first subgroup has a smaller standardized MSE (around 1.06 standard errors) than its counterparts.
The results at different stages show a contrary picture. More trials stop at Stage 2 than at Stage 1 and each stage has a higher proportion of selecting the correct population (around 30% and 50% at Stage 1 and Stage 2, respectively). The bias is large at Stage 1. The selection and reporting bias are smaller when selecting S 1 (around 1.1 standard errors) than those when selecting S 1+2 or F (around 1.8 and 2, respectively). A moderate bias is observed at Stage 2 (up to 0.85 standard errors). In particular, the selection and reporting bias are found negative in the estimator for the first subgroup. The standardized MSE of all the estimators at Stage 1 are much larger than one SE but those at Stage 2 show the opposite pattern, being less than 1 (between 0.8 and 1).

DISCUSSIONS AND CONCLUDING REMARKS
In this paper, we have discussed general design considerations for clinical trials with subpopulation selection and illustrate how such studies can be designed. The design framework described can be viewed as an extension of group-sequential methods 42 and therefore requires the same types of assumptions and specifically we do assume an independent increment structure of the data. In our evaluations, we have assumed that the primary endpoint is available immediately or at least before the next patient is recruited to the trial. While the general results in the paper will remain to hold if the endpoint is available only after some time, patients may still be recruited from a subpopulation that is subsequently not selected. Different approaches to deal with delayed responses have been proposed (eg, the work of Hampson and Jennison 43 ) in the context of group-sequential trials have been proposed. As a general rule, however, it is clear that the efficiency of selection is reduced if the time to observe the endpoint is long in comparison with the recruitment speed. Other assumptions made within this framework are common to most adaptive designs. Most notably, we are assuming that there are differences in the population before and after interim analysis and, in particular, that no time trends are present.
In this work, we only consider designs with normally distributed endpoints, although they can easily be extended to other types of endpoints via the efficient scores framework. 42,44 Note, however, that particular care is required when using time to event endpoints (see the work of Magirr et al} 45 for a more detailed challenges of adaptive trials with time to event endpoints). Moreover, we assume that the subgroup prevalence is known although clearly specifying this parameter correctly in the design will be crucial for the designs operating characteristics. A consequence of the assumed known prevalence is that we only present the estimation assessment of the MLE, where subgroup sample sizes are fixed according to the respective prevalence in designs. Further simulations (not shown), however, suggest that random sample sizes of populations only alter the findings marginally.
Selection based on the maximum test statistics is the main focus throughout the paper and an R package implementing this design is currently under development. While this selection rule is simple and intuitive, it may not be optimal in certain circumstances. It makes sense to adopt the rule when some subgroup treatment effects have been identified as being positive and difference between test statistics across subgroups are reasonably large. However, when the test statistic for S s and F are close but the former is larger, applying this rule leads to ethical issues that selecting only part of the population rather than the whole population although they could benefit from the treatment. Therefore, other options for selection rules should be considered for similar situations and investigation.
One alternative, which is also considered for designs with treatment selection (eg, see the work of Bretz et al 46 ), can be to introduce a threshold in the selection rule. This allows all the subgroups whose effect sizes are similar to the best one (their absolute difference is within a threshold) to be united so that the pooled population can continue to the next stage. Meanwhile, it also permits to select a population whose effect size is above a threshold plus the effect size from the others.
Another option that has been used in the context of treatment selection (eg, see the works of Magnusson and Turnbull 16 and Magirr and Jaki 35 ) is simply to select a population whose efficacy exceeds a certain value at stage 1. This selection rule was used in the work of Magnusson and Turnbull 16 and integrates population selection and hypothesis testing at the first stage. Their designs considering a prior ordering on underlying effect sizes of all individual subgroups somehow connect to ours where the target subpopulations for selection has a nested structure. It is noted that the mathematical expression of p Z (1) W ,W (·, ·) in (1) will be different if the aforementioned selection rules are used. We provide the required modifications to the design framework in the supplementary materials for illustrative purposes.
In term of estimation, we have assessed the bias of the MLE under various scenarios. We find that, almost always, bias is positive leading to an overenthusiastic estimate of the true treatment effect. While for some settings, the size of the bias can be viewed as negligible, it can become large under other situations. The challenge clearly being that one will usually not know if one is in one of these extreme situations. Another observation we make is that, although bias is introduced by selecting the population, the bias gets markedly increased (often more than doubled) when only significant results are reported highlighting the effect of reporting bias, which may be even more problematic than the bias introduced by selection.
Our results suggest that the MSE of the overall MLEs performs quite well (around 1 standard error) in many circumstances and scenarios. We find whether selecting the correct population or not impacts the size of MSE for the corresponding estimator. The extent can be more substantial when further reporting significant results. The same finding is even observed in the extreme scenario, where no correct population is defined because the underlying effect of each subgroup is assumed none.
Future work will consider estimators that are unbiased (or have smaller bias) while maintaining comparable MSE. The conditional bias-adjusted estimator following the ideas in the work of Stallard and Todd 28 appears as the most promising. One extension to the case of multiple-stage designs given the process continues to the final stage can be naturally achieved. However, whether the derived estimators have less MSE should be verified in further investigations.

A.3 Design 3
For this single-stage design with three subgroups, the prevalance of each subgroup is equal to one third, resulting in a critical value of c = 2.289 and a total sample size of N = 575.

A.4 Design 4
The two stage design with three subgroups uses equal prevalance of each subgroup and an interim analysis after half the patients have been observed. The critical value at the first stage is c 1 = 3.119, whereas the final critical value is c 2 = 2.205. A fixed futility bound of zero is used and the total sample size is N = 335.