Point estimation following two‐stage adaptive threshold enrichment clinical trials

Recently, several study designs incorporating treatment effect assessment in biomarker‐based subpopulations have been proposed. Most statistical methodologies for such designs focus on the control of type I error rate and power. In this paper, we have developed point estimators for clinical trials that use the two‐stage adaptive enrichment threshold design. The design consists of two stages, where in stage 1, patients are recruited in the full population. Stage 1 outcome data are then used to perform interim analysis to decide whether the trial continues to stage 2 with the full population or a subpopulation. The subpopulation is defined based on one of the candidate threshold values of a numerical predictive biomarker. To estimate treatment effect in the selected subpopulation, we have derived unbiased estimators, shrinkage estimators, and estimators that estimate bias and subtract it from the naive estimate. We have recommended one of the unbiased estimators. However, since none of the estimators dominated in all simulation scenarios based on both bias and mean squared error, an alternative strategy would be to use a hybrid estimator where the estimator used depends on the subpopulation selected. This would require a simulation study of plausible scenarios before the trial.

Recently, several study designs incorporating treatment effect assessment in biomarker-based subpopulations have been proposed. Most statistical methodologies for such designs focus on the control of type I error rate and power. In this paper, we have developed point estimators for clinical trials that use the two-stage adaptive enrichment threshold design. The design consists of two stages, where in stage 1, patients are recruited in the full population. Stage 1 outcome data are then used to perform interim analysis to decide whether the trial continues to stage 2 with the full population or a subpopulation. The subpopulation is defined based on one of the candidate threshold values of a numerical predictive biomarker. To estimate treatment effect in the selected subpopulation, we have derived unbiased estimators, shrinkage estimators, and estimators that estimate bias and subtract it from the naive estimate. We have recommended one of the unbiased estimators. However, since none of the estimators dominated in all simulation scenarios based on both bias and mean squared error, an alternative strategy would be to use a hybrid estimator where the estimator used depends on the subpopulation selected. This would require a simulation study of plausible scenarios before the trial. the size of tumor, protein level in the blood, and graded scores. When the clinical utility of the biomarker is not very strong or clear from previous studies, the biomarker stratified design may be used to test the effect of an experimental treatment. In this design, a trial enrolls patients from the full population but with provision for analyses of outcomes from the subpopulation.
One methodological challenge in stratified medicine is how to design and analyze efficient clinical trials that incorporate identification of the subpopulation that will benefit from the experimental treatment. An efficient design in late phase clinical trials is the two-stage adaptive enrichment design. 2 In stage 1, patients are recruited from the full population and data are used to perform an interim analysis to decide whether, in stage 2, enrollment will be from the full population or the subpopulation. The final confirmatory analysis uses data from both stages. Although the design is efficient because stage 1 data are used for subpopulation selection and confirmatory analysis, the latter is complex because of inclusion of subpopulation selection data.
We consider the case of a continuous (or a graded score) biomarker where the cut-off value to distinguish between biomarker positive and negative patients is not definite from previous trials. Consequently, several candidate cut-off values are possible, with trial data used to determine the cut-off value. Simon and Simon 2 refer to such a design that includes threshold determination as an adaptive threshold enrichment design. We give examples of clinical trials where this design can be used in Section 2.1.
Subpopulation selection based on the treatment effect can be advantageous because using an appropriate rule, the subgroup is selected in the case where there is apparent benefit in the subgroup and not in its complement (qualitative interaction) such as was observed by Mok et al. 3 The full population is selected if there is apparent benefit in the full population including when the drug benefits the subgroup and its complement with different magnitudes (quantitative interaction) such as was observed in Tran et al. 4 A subpopulation selection based on a hypothesis test for interaction only would not be able to distinguish between the two types of interactions.
Previous research that considers analysis of adaptive threshold enrichment trials focuses on control of type I error rate and power with less emphases on point estimation. 2,5 Recently, Li et al 6 have derived expressions for the biases of estimators that ignore the adaptation but do not propose point estimators that account for subpopulation selection. Kimani et al 7 and Kunzmann et al 8 have developed estimators for a setting analogous to a single fixed cut-off value. However, these estimators do not allow for using stage 1 data to determine the cut-off value in an adaptive threshold enrichment trial.
A setting similar to an adaptive threshold enrichment design is that of treatment selection, where a control is compared to multiple experimental treatments, with stage 1 data used to select the experimental treatment to test further in stage 2. [9][10][11][12][13][14][15][16] Although several point estimators for this setting exist, they cannot be applied directly in adaptive threshold enrichment clinical trials because the correlation structure of the stage 1 sample means used for selection is different.
In this paper, we develop estimators that account for subpopulation selection following adaptive threshold enrichment trials using the principles that have been used to obtain point estimators that account for treatment selection. Two unbiased estimators build on the works by Kimani et al 7 and Robertson et al. 17 Two estimators build on the works by Whitehead 18 and Stallard and Todd 10 and involve deriving the bias function to calculate bias and subtracting bias from the naive estimator. The last is a shrinkage estimator and builds on the works by Hwang 19 and Carreras and Brannath. 14

Motivation and notation
A condition where continuous biomarkers are tested and so the adaptive threshold design may be used is depression. Examples of continuous predictive biomarkers in depression are protein levels in the blood and an electrophysiological measure. 20 While introducing notation, we describe features of clinical trials that are key in our methodology based on the setting of depression.
Patients' outcomes will be assumed to be normally distributed with a known standard deviation . In the context of depression, Uher et al 20 perform simulations to give a guidance of the treatment effect size to be sought when predictive biomarkers are evaluated. One outcome measure they consider that is widely used in trials is the Hamilton Rating Scale for Depression (HRSD) score and is usually assumed to be normally distributed. For a trial of a prespecified duration of treatment, the aim may be to estimate the mean difference (experimental arm minus control arm) in HRSD scores between two interventions at the final follow-up visit. Based on two trials, 21,22 the standard deviation of HRSD scores may be taken to be 7, that is, = 7.
We will consider trials that allow stopping for futility at an interim analysis if the observed treatment difference is less than some value b that we refer to as the futility boundary. The UK NICE guidelines recommend that an intervention for depression should demonstrate a difference of at least 3 HRSD points 20 to be considered superior to its comparator. Therefore, at an interim analysis, the treatment may be deemed not to warrant further testing if the observed mean difference < 2 (slightly less than the recommended value of 3), that is, b = 2.
We assume that a single continuous biomarker is used to identify the patients who benefit from a new intervention. We assume that in regard to biomarker values, there is monotonicity in treatment effect so that a higher biomarker value leads to a bigger treatment effect or a higher biomarker value leads to a smaller treatment effect. For ease of notation, we use the latter to develop methodology. Note that, if a higher biomarker value leads to a bigger treatment effect, the biomarker values can be transformed by multiplying by −1.
Using some biomarker threshold values, the full population (F) is partitioned into distinct partitions. For example, if F is subdivided into four partitions, the candidate threshold values c 1 , c 2 , c 3 , and c 4 are such that patients in partitions 1, 2, 3, and 4 have biomarker values less than c 1 , between c 1 and c 2 , between c 2 and c 3 , and between c 3 and c 4 , respectively. The true mean differences in partitions 1 to 4 are denoted by 1 , 2 , 3 , and 4 , respectively. We denote the number of partitions by K so that, in this case, K = 4. We refer to the parts of F below threshold values c 1 , c 2 , c 3 , and c 4 as subpopulations S 1 , S 2 , S 3 , and S 4 . Note for K = 4, S K = S 4 = F, and S 1 , S 2 , S 3 , and S 4 consist of partition 1, partitions 1 and 2, partitions 1 to 3, and partitions 1 to 4, respectively. The true mean differences in S 1 , S 2 , S 3 , and S 4 are denoted by 1 , 2 , 3 , and 4 , respectively. If, as expected, a higher biomarker value leads to a smaller treatment effect, then 1 ≥ 2 ≥ 3 ≥ 4 and We assume that the threshold values c 1 , … , c K are prespecified. There are different ways for the choice of the thresholds values. For K = 4, quartiles may be used so that the prevalences for S 1 to S 4 are p 1 = 0.25, p 2 = 0.50, p 3 = 0.75, and p 4 = 1, respectively. Consequently, the partitions have equal prevalence (0.25) since if we set p 0 = 0, p i − p i − 1 = 0.25 (i = 1, … , 4). In some instances, the threshold values are chosen based on aspects such as biological activity so that the prevalences for partitions are not equal. Figure 1 summarizes the partitioning of F for any K ≥ 3.

Hypothetical two-stage adaptive threshold enrichment clinical trial
Predictive assessment of continuous biomarkers can been done in single-stage clinical trials. 23,24 The alternative is to use the two-stage adaptive threshold enrichment design, which is more efficient as more resources can be focused on the subpopulation that is most likely to benefit from the new treatment. 24 The design has been used in recent trials with time-to-event (progression-free survival) outcome data. 5,25,26 As we propose in this paper, the design can be similarly used in trials with normally distributed outcome data. We note in Section 6 that the methods developed in this paper can be adapted for time-to-event outcome data.
We describe the form of the adaptive threshold enrichment design that we consider based on a hypothetical trial for depression, where for example protein level is used to partition F into quartiles. In stage 1, the trial recruits n 11 = 90, n 12 = 90, n 13 = 90, and n 1K = n 14 = 90 patients in partitions 1 to 4. The number of patients in S 1 to S 4 are m 11 = 90, m 12 = 180, m 13 = 270, and m 1K = n 14 = 360, respectively, since m 1i = ∑ i i ′ =1 n 1i ′ (i = 1, … , 4). For simplicity, we assume that, in each partition, the 90 patients are equally split between the control and the experimental treatment. The outcome of interest is HRSD score and is assumed to be normally distributed with = 7. Let 2 11 = 4 2 ∕n 11 , 2 12 = 4 2 ∕n 12 ,
Since a higher biomarker value is expected to lead to lower treatment effect, the largest subpopulation for which the observed stage 1 sample mean difference (in HRSD scores) is ≥ b is selected to continue to stage 2. If the observed stage 1 sample mean differences in S 1 , S 2 , S 3 , and S 4 = F are all less than b, the trial stops for futility. Note that the selected subpopulation is a random variable determined by observed stage 1 data. We use lower case s(s ∈ {1, … , 4}) as the index for the "observed" selected subpopulation, with S s (s ∈ {1, … , 4}) denoting the selected subpopulation. At the end of stage 2, the primary objective is to obtain an estimate for s , using an estimator that has good properties such as being mean unbiased and having small mean squared error (MSE).
In stage 2, the trial recruits n 21 = 120 and n 22 = 120 patients in partitions 1 and 2, respectively. The number of patients in  Table 1 summarizes the notation we have introduced for any K ≥ 3. When a subscript in a notation includes two indices, the first corresponds to stage and the second to partition or subpopulation.
Suppose that, in stage 2, the observed sample mean differences in partitions 1 and 2 arex 21 = 3.0 andx 22 = 2.4. Consequently, the stage 2 observed sample mean difference for S 2 is̄2 2 = 2.7. The naive estimate for 2 is the two-stage sample mean difference for S 2 given bŷ2 ,N = (m 12̄12 + m 22̄22 )∕(m 21 + m 22 ) = 2.614. We describe in Section 2.4 that the naive estimates are biased because they ignore subpopulation selection. The aim of this paper is to develop estimators that adjust for subpopulation selection. The estimators are based on the selection rule described for the hypothetical trial, which we state for any K ≥ 3 partitions in the next section, and are conditional on the observed ordering of stage 1 data.

Selection rule
We derive estimators that are unbiased or with small bias conditional on the following specific selection rule. Other selection rules are considered in the discussion. Let b denote a futility boundary. The trial stops after stage 1 if̄1 i < b  True mean Partition for all i (i = 1, … , K). The trial continues to stage 2 with the full population (S K ) if̄1 K ≥ b and with subpopulation ) and

Naive estimation
For the selected subpopulation S s (s ∈ {1, … , K}), define t s = m 1s ∕(m 1s + m 2s ). The naive estimator for s that ignores subpopulation selection iŝs This is biased because the first term in (1) includes data used in the selection. Let 1 [S s ] and Prob(S s ) denote the indicator and probability of selecting S s , respectively. The conditional bias is Using the joint density forX 1 orȲ 1 to compute Prob(S s ) and is computationally time consuming because the limits of integration for each element in the vector depend on the values of the other elements. To overcome this, we use The density for Z and the expressions for Prob(S s ) and are provided in the supplementary material.

Unbiased estimators 3.1.1 General principles of obtaining unbiased estimators
One technique to account for subpopulation selection is Rao-Blackwellization. By the Rao-Blackwell theorem, conditional on a sufficient and complete statistic based on stages 1 and 2 data, the expected value of a conditionally unbiased estimator from the stage 2 data is the uniformly minimum variance conditional unbiased estimator (UMVCUE). We consider two methods for obtaining unbiased estimators for s : deriving an UMVCUE for s directly or, because the relationship between and is linear, deriving the UMVCUE for each i (i = 1, … , s) and using a linear function to obtain an unbiased (though not necessarily minimum variance) estimator for s . The latter builds on the work by Kimani et al. 7 The former would involve correlated stage 1 statistics in the vectorȲ 1 and builds on the work by Robertson et al. 17

Uniformly minimum variance unbiased estimator following the work of Robertson et al (2016a)
The UMVCUE for s is the expected value ofȲ 2s conditional on a sufficient and complete statistic. As before, let̂s ,N denote the naive estimator for s given by expression (1) and U be as u in Section 2.3 withx 1,s+1 , … ,x 1K replaced with X 1,s+1 , … ,X 1K . Following the work of Robertson et al, 17 the UMVCUE for s is , and (.) and Φ(.) denote the density and distribution functions of a standard normal, respectively.

Unbiased estimator following the work of Kimani et al (2015)
The UMVCUE for i ′ (i ′ = 1, … , s) is the expected value ofX 2i ′ conditional on a sufficient and complete statistic. Let , . Consequently, the unbiased estimator for s iŝ

An overview of bias-adjusted estimation
Another technique to account for subpopulation selection would be to utilize the fact that we can calculate bias of the naive estimate using expression (2). The naive estimate is then adjusted by subtracting the bias. However, expression (2) is a function of (or equivalently ), the vector of the unknown treatment effects. To overcome this, we estimate bias, and hence, bias-adjusted estimators obtained in this way are not necessarily mean unbiased.

Single-iteration bias-adjusted estimator
We consider two bias-adjusted estimators. For the first one, the bias is estimated based on the observed sample mean denote the bias estimator for s obtained by replacing witĥin expression (2) to get an adjusted estimator for s of We will refer to this estimator as the single-iteration bias-adjusted estimator.

Multiple-iteration bias-adjusted estimator
For the second bias-adjusted estimator, the bias is estimated iteratively. 10,13,18 Let̂i (i = 1, … , K) denote the naive estimator for i and̂= (̂1, … ,̂K) ′ . The biases for the naive estimators depend on and we denote bias for̂i (i = 1, … , K) by b i ( ) and the vector (b 1 ( ), … , b K ( )) by b( ). The second adjusted estimator, which we refer to as multiple-iteration bias-adjusted estimator is obtained by solving̃=̂− b(̃) iteratively. Using similar notation, alternatively, one could solvẽ=̂− b(̃) and then use the relationship between and to obtain a bias-adjusted estimate for s . For the simulations in Section 5, we solvẽ=̂− b(̃) and with an accuracy of 0.001, convergence was achieved in almost all simulated trials. Suppose that the solution is obtained at iteration r and let b i (̃r) denote the bias for̂i when is taken to bẽr, then the multiple-iteration adjusted estimate for i iŝi ,MI =̂i − b i (̃r) and the multiple-iteration bias-adjusted estimator for s iŝs The details of calculating b i (̃r) are given in the supplementary materials.

Shrinkage estimators 3.3.1 General principles for shrinkage estimation
A third technique for accounting for subpopulation selection is to use shrinkage methods. Hwang 19 considered the case of estimating a treatment mean after ordering independent sample means in a single-stage trial for K ≥ 4. A subpopulation selection rule that corresponds to Hwang's case is that of selecting only one partition based on some ordering ofx 11 , … ,x 1K . We initially consider Hwang's selection rule and denote the selected partition by s H (s H ∈ {1, … , K}).
Hwang assigns a common normal prior distribution N( , 2 ) to each i (i = 1, … , K). The posterior mean for s H , its Bayes estimator, is and n is stage 1 sample size in each intervention in each partition. Replacing the unknown and C with their unbiased estimatorsȲ 1K = ∑ K i=1X 1 ∕K and , respectively, gives the empirical Bayes estimator. LetĈ + = max{0,Ĉ}, Hwang indicates that a better estimator, which we refer to as the shrinkage estimator, iŝs H , Carreras and Brannath 14 extended the work to two-stage trials. Define t s H = n 1s H ∕(n 1s H + n 2s H ) to be the proportion of stage 1 data. The two-stage shrinkage estimator for s H iŝs H ,B = t s ĤsH , Using the fact that the estimator of Hwang 19 applies for all parameters i (i = 1, … , K) and that its examination by Carreras and Brannath showed that it works for any rule used to pick the parameters on which to make inference, in Sections 3.3.2 and 3.3.3, we extend this work to give two shrinkage estimators for the subpopulation selection rule in Section 2.3.

First shrinkage estimator
As in unbiased estimation, we consider both combining shrinkage estimators for treatment effects in partitions to obtain an estimator for s and directly obtaining a shrinkage estimator for s . From Section 3.3.1, the shrinkage estimator for The first shrinkage estimator for s iŝs

Second shrinkage estimator
The second shrinkage estimator, which we denote bŷs ,L 2 , involves using the entire parameter vector . A multivariate normal prior for is specified and updated with the dataȲ 1 . The resulting posterior is multivariate normal with nonzero covariance, and hence, the iterative procedure of Morris 27 and Brüncker et al 28 is utilized to obtain̂s ,L 2 (see supplementary material).

WORKED EXAMPLE
We use data from the hypothetical trial for depression in Section 2.2 to demonstrate how to compute the naive (̂2 ,N ), the UMVCUE (̂2 ,UMV ), the unbiased (̂2 ,U ), the single-iteration bias-adjusted (̂2 ,SI ), the multiple-iteration bias-adjusted (̂2 ,MI ), the first shrinkage (̂2 ,L 1 ), and the second shrinkage (̂2 ,L 2 ) estimates. We also use the example to demonstrate differences among the various estimates in a single trial. The data and the various estimates are summarized in Table 2. The explicit computations for the various estimates and the R program used are provided in the supplementary material. Here, we only give explicit details of computinĝ2 ,UMV and̂2 ,U as they are easier to compute, and since based on the simulations in the next section, we recommend̂2 ,UMV . For the UMVCUE (̂2 ,UMV ) given by expression (3) Similarly, for partition 2, w 2 = 2.  Table 2). This may be explained by the observation in Section 5.2 that, in some scenarios, the naive estimator is negatively biased. The estimatê2 ,SI is slightly smaller than 2,MI . Again, this may be explained by an observation in Section 5.2 that, for all scenarios in the simulation study, on average, the single-iteration bias-adjusted estimator gives a smaller estimate than the multiple-iteration estimator.

Simulations setting
To evaluate the properties of the various estimators, we conducted simulations with 2 = 1 and b = 0. We initially consider the case of K = 4 and p i − p i − 1 = 0.25 (i = 1, … , 4). In all simulations, if the trial continues to stage 2, the combined stages 1 and 2 sample size is set to be 800. For example, if the stage 1 sample size is 400 patients, the stage 2 sample size is 400. The available patients in stage 1 are equally split among the four partitions and treatment arms. For example, with 400 patients in stage 1, in each partition, 50 patients are randomly allocated to each of the control and experimental treatment. Similarly, the patients available for testing in stage 2 are equally split among the partitions that continue to stage 2 and among the treatment arms. Hence, with 400 patients available in stage 2, if F is selected, the patient allocation in stage 2 is as in stage 1 with 400 patients. If S 2 is selected so that two partitions are tested in stage 2, in each partition, 100 patients are randomly allocated to each of the control and experimental treatment. We perform simulations for three cases of stage 1 sample size (200, 400, and 600 patients). Taking the combined stages 1 and 2 to be 800 patients is justified in the supplementary material.
We consider seven scenarios with true treatment effects as summarized in Table 3. The selection rule and estimators developed are aimed at identifying predictive effects, but since we are estimating mean differences, the methods are valid with or without prognostic effects. If the biomarker has no predictive effect but has a prognostic effect, we are in a scenario of equal treatment effects in all partitions. Scenarios 1, 3, and 7 could be such cases. If there are prognostic and predictive effects, we are in a scenario of unequal treatment effects in partitions. Scenarios 2, 4, 5, and 6 could be such cases. In Scenarios 1 to 3, the right decision is to continue to stage 2 with F, but with decreasing probability of selecting F. The right decisions for Scenarios 4 to 6 are to continue with S 3 , S 2 , and S 1 , respectively. The ideal decision for Scenario 7 is to stop at stage 1. The probabilities for various decisions for different scenarios when stage 1 includes 200 patients (25 in each treatment arm in each partition) are also given in Table 3. These have been calculated using expressions in Section 2.4 and in the supplementary material. As expected, the probability of stopping the trial at stage 1 (last column) increases as the treatment effects in partitions become less than b in more partitions (from 0.007 for Scenario 1 to 0.482 for Scenario 7). In each of Scenarios 4 to 6, the probability of continuing with F is substantially larger than the probability of making the right decision, demonstrating that, in some configurations, decision making is challenging. In Section 5.2.1, simulations show that incorrect decisions tend to be made when observed means are substantially different from the true means and hence lead to bias. Table 4 gives probabilities of various decisions when the stage 1 sample sizes are 400 and 600. For scenario 3, where treatment effects are equal in all partitions and equal to the futility boundary, the probabilities of various decisions are approximately equal for different stage 1 sample sizes. For the other scenarios, by comparing the probabilities in bold, the probability of making a correct decision increases with stage 1 sample size.
For each of the seven scenarios and three different stage 1 sample sizes, we simulated stage 1 data for N = 1 000 000 trials. For each trial, the subpopulation with the largest simulated sample mean difference ≥ 0 continues to stage 2. If no subpopulation fulfills this, the trial stops. We consider estimation conditional on continuing to stage 2 and so bias and MSE for each estimator are evaluated based on simulated trials that continue to stage 2. Usinĝs ,SI for illustration, for each s(s = 1, … , 4), bias and MSE are calculated as bias(̂s ,   correspond to the cases of selecting F, S 3 , S 2 , and S 1 , respectively. The y-axes correspond to biases divided by approximate standard errors (SEs). The approximate SE = √ 4∕(m 1s + m 2s ) and so SEs are only equal when F is selected (Column 1). Although SEs are not equal, we will later observe from the boxplots of the estimates that the trend for bias is the same when bias is not divided by SE. The x-axes correspond to the seven scenarios. As per the legend, biases for different estimators are distinguished by different line types. Estimatorŝs ,UMV and̂s ,U are not included in Figure 2 because they are mean unbiased. For Scenario 1, the probabilities for selecting S 3 , S 2 , and S 1 are low and so simulations results are highly variable when S 3 , S 2 , or S 1 is selected but this does not change the general findings in this paper.

Comparing biases for the various estimators
We first describe the results for the case where the stage 1 sample size is 200 (top row). When F is selected, the naive estimator (̂s ,N ) and the first shrinkage estimator (̂s ,L 1 ) are the same and correspond to the line showing the largest biases. Focusing on the naive estimator, the bias when F is selected (Column 1) is positive in all scenarios. For scenarios where the right decision is to continue with F (Scenarios 1 to 3, see Table 3), bias when F is selected is attributable to the futility rule with the bias negligible when the effect in F is substantially larger than the futility boundary (Scenario 1). When the right decision is not to continue with F (Scenarios 4 to 7) but F is selected, the impact of selection and futility on bias would increase and consequently give a larger bias. Still focusing on the top row, when S 3 is selected (Column 2), the naive estimator for 3 is negatively biased for some scenarios and positively biased for other scenarios. The explanation for this pattern is given in the supplementary material. Comparing the bias when F, S 3 , S 2 , and S 1 are selected (Columns 1 to 4), the bias is smallest when S 1 is selected. This can be attributed partly to the enrichment, where the stage 2 sample size is fixed regardless of the size of the population selected so that when S 1 is selected, proportionally, there are more unbiased stage 2 data to estimate 1 compared to when F, S 3 , or S 2 is selected. In summary, note that, in some scenarios, the bias of the naive estimator is substantial and so it is essential to use an estimator that corrects for subpopulation selection.
Still focusing on the top row, when F is selected, practically, the single-iteration bias corrected estimator̂s ,SI is mean unbiased, especially for Scenarios 1 to 3 where the correct decision is to select F. When S 3 is selected,̂s ,SI almost eradicates bias in Scenarios 3 to 7 and is better than the naive estimator in Scenarios 1 and 2. When S 2 or S 1 is selected,̂s ,SI eradicates almost all bias in Scenarios 2 to 7 but does not do so in Scenario 1. In all scenarios, the line for the multiple-iteration bias-adjusted estimator (̂s ,MI ) is always slightly above that of̂s ,SI . Hence, comparinĝs ,SI and̂s ,MI , when̂s ,SI is negatively biased,̂s ,MI is preferable, whereaŝs ,SI is preferable when it is positively biased.
Comparing biases for the naive estimator for different stage 1 sample sizes (top versus bottom plots), as also indicated by expression (2), the bias increases with the proportion of stage 1 data. Increase in bias is also seen for both the single-iteration (̂s ,SI ) and multiple-iteration (̂s ,MI ) bias-adjusted estimators. From the bottom row,̂s ,SI and̂s ,MI perform worst when some partitions that should be dropped at stage 1 continue to stage 2 or when some partitions that should continue to stage 2 are dropped. As before, the line for̂s ,MI is above that of̂s ,SI with the distances between the lines increasing with stage 1 sample size.
The pattern of the shrinkage estimators is best understood by considering all results in Figure 2. In all cases, the line for the first shrinkage estimator (̂s ,L 1 ) overlaps or is above that of the second shrinkage estimator (̂s ,L 2 ). Estimator̂s ,L 1 performs similar to or better than̂s ,L 2 when the selected subpopulation consists of partitions that should continue to stage 2 such as when F is selected in Scenarios 1 to 3 and such as when S 3 is selected in Scenarios 1 to 4. Estimator̂s ,L 2 performs better than̂s ,L 1 when the selected subpopulation consists of partitions that should not continue to stage 2 such as when F is selected in Scenarios 4 to 7 and such as when S 3 is selected in Scenarios 5 to 7. In almost all scenarios, the two shrinkage estimators perform worse than the other estimators that account for adaptation. One reason for this may be the fact that the shrinkage estimators do not account for stopping for futility. When F is selected, the naive estimator is the same as the first shrinkage estimator. This is because the stage 1 estimate in partition i isĈ +X1i + (1 −Ĉ + )Ȳ 1K so that the shrinkage estimator shrinks to the effect in the full population, that is, tō Y 1K = (X 11 + · · · +X 1K )∕K. A reasonable alternative would be to use a weighted mean ofȲ 11 ,Ȳ 12 , … ,Ȳ 1K . For example, if we shrink to (Ȳ 11 + · · · +Ȳ 1K )∕K, in terms of sample means in partitions, we are shrinking to a weighted sum such that for i < i ′ ,X 1i has more weight thanX 1i ′ . In such a case, shrinkage estimators will be closer to the naive estimators when fewer partitions are selected (see additional simulations in the supplementary material).

Comparing MSEs for the various estimators
Mean squared errors for the various estimators are given in Figure 3. The y-axes are root mean squares (RMSE = √ MSE) divided by approximate SEs. The best shrinkage estimator in terms of bias (either̂s ,L 1 or̂s ,L 2 depending on the scenario) has smaller or practically the same MSEs as the naive estimator. Hence, the best shrinkage estimators may be considered to be better than the naive estimator in terms of MSE. The challenge, however, is determining the best shrinkage estimator since the true treatment means are unknown.
Since estimators that extend the works of Kimani et al (̂s ,U ) and of Robertson et al (̂s ,UMV ) are mean unbiased, their MSEs are variances. When S 1 is selected, by derivation, the two estimators are the same and, hence, have equal MSE. For any other selection, as expected,̂s ,UMV has smaller MSE than̂s ,U . The differences increase with stage 1 sample size (top versus bottom plots) and the size of the selected subpopulation (right to left panels). The MSEs of̂s ,U and̂s ,UMV are mostly larger than the MSEs for all the other estimators with the differences substantial when selection is performed later in the trial.
In general, the MSEs for the single-iteration (̂s ,SI ) and multiple-iteration (̂s ,MI ) bias-adjusted estimators are practically the same. Hence, since their biases are also similar, the two estimators are approximately equivalent and so it is sufficient to compare one of them to the other estimators. The MSE for̂s ,SI is larger than that of the naive estimator (̂s ,N ) in most cases while it is always smaller than the MSEs for the unbiased estimators (̂s ,U and̂s ,UMV ).

Comparing the estimators using both bias and MSE
Comparing the shrinkage estimators (̂s ,L 1 and̂s ,L 2 ) to the naive estimator (̂s ,N ), we prefer̂s ,N . This is because although a shrinkage estimator sometimes has a smaller MSE, it can have substantially higher bias than̂s ,N (for example, compare Columns 4 in Figures 2 and 3).
Comparing the single-iteration bias-adjusted estimator (̂s ,SI ) and the naive estimator (̂s ,N ), when F is selected,̂s ,SI is preferable as it reduces bias substantially and has smaller MSE. However, when S 1 is selected,̂s ,N is better as it has smaller MSE and it does not differ from̂s ,SI in terms of bias. When S 3 or S 2 is selected,̂s ,N is better when bias is not substantial (Scenarios 3 and 4), whereas for Scenarios 5 to 7,̂s ,SI is better as it reduces bias and its MSE is better or only slightly higher than that of̂s ,N . Overall, we consider̂s ,SI as a better estimator than̂s ,N as it performs better in cases with substantial bias.
When F is selected, the bias of the naive estimator (̂s ,N ) is substantial and compared to the UMVCUE (̂s ,UMV ), we prefer the latter since the difference in RMSE between the two estimators is smaller than the bias eradicated. When S 1 is selected, we would also recommend̂s ,UMV over̂s ,N as the former is mean unbiased in all scenarios, with the only case where it is not clearly superior due to high RMSE being when n 1 = 600. The conclusion when S 3 or S 2 is selected is the same as when S 1 is selected, that is,̂s ,UMV is better than̂s ,N .
Comparing the single-iteration bias-adjusted estimator̂s ,SI to the UMVCUÊs ,UMV , we recommend the latter since, when F is selected,̂s ,SI has substantial bias that is larger than the difference in RMSE between it and̂s ,UMV . In addition, when S 1 is selected, the difference in RMSE between the two estimators is smaller than the bias of̂s ,SI . Consequently, based on the performance across the scenarios in the simulation study, we recommend̂s ,UMV when an adaptive threshold enrichment design is used.
For a more detailed comparison of the estimators, Figures 4 and 5 give boxplots of simulated estimates for Scenarios 1 (top plots), 4 (middle plots), and 6 (bottom plots) described in Table 3 when F and S 3 are selected. The boxplots emphasize the findings summarized above. As an example, when n 1 = 600 ( Figure 5), for Scenario 6 (bottom left panel), almost all naive estimates are above the true value and̂s ,UMV performs well in that case. From the left panels, we note the unbiased estimators (̂s ,UMV and̂s ,U ) have substantially higher variances compared to the other estimators.

Summary findings and recommendations from the simulation study
The bias of the naive estimator can be substantial, and so it is essential to use an estimator that corrects for the decision made using stage 1 data. We recommend the estimator that follows the work of Robertson et al (̂s ,UMV ) since it is mean unbiased. Although it has larger MSE than some estimators, the bias eradicated in most cases was larger than the difference in RMSEs. Although the simulation study was based on four partitions and specific treatment effect  We have recommended one estimator for all scenarios. An alternative is a hybrid estimator where the recommended estimator (̂s ,SI or̂s ,UMV ) depends on the subpopulation selected. This is suitable if investigators are willing to sacrifice unbiasedness for more precision. In this case, before the trial, a simulation study based on plausible scenarios would be required to compare bias and MSE conditional on the selected subpopulation.

DISCUSSION
Acknowledging that different patients may require different care has led to trial designs that incorporate assessment of treatment effects in different subsets of the population. Most statistical methodologies for such designs focus on hypothesis testing. 2,5,24,26,[29][30][31][32][33] In this paper, we have considered point estimation following an adaptive threshold enrichment clinical trial. We have assessed bias for the naive estimator when different subpopulations are selected. Depending on the scenario, the bias of the naive estimator of the treatment effect in the selected subpopulation is substantial and can be negative or positive. There is thus a need for new estimators. Building on estimators that have been proposed for treatment selection, we have derived several estimators that account for subpopulation selection. By derivation, two estimators are mean unbiased. In this paper, we have recommended the best among these two, that is, the UMVCUE. An alternative is a hybrid estimator where different estimators are recommended based on the selected subpopulation. This would require a simulation study before the trial and is suitable if investigators can accept some unbiasedness for a more precise estimator. We have considered a specific selection rule but the proposed estimators can be modified for other selection rules. For example, it may be desired that different subpopulations have different futility boundaries. Futility boundaries may be based on factors such as subpopulation prevalence, and sponsor and public health gains. 34 Another factor is safety where the futility boundary may be chosen to reflect investigators' willingness to accept moderate efficacy if the new treatment is substantially safer than the control. The selection rule we have used specifies that a higher biomarker value leads to a smaller treatment effect. If this is a misspecification of the relationship between the biomarker and treatment effect, the unbiased estimators will remain so because we condition on the selection rule. However, the probability of making the right decision will be low and we anticipate that the naive estimator will have more bias and that the unbiased estimators will have higher MSE.
In the derivations, we have not required the prevalences in different partitions to be equal. If the biomarker values are approximately continuous, then it is reasonable to subdivide the full population into equal partitions as we have done in the example and the simulations. Other numerical biomarker values may be discrete with few possible values, leading to partitions with varying sizes.
We have assumed the number of patients in each partition, and hence prevalence, is known. For the case of two partitions and a fixed cut-off value, taking the stage 1 number of patients in a partition to have a binomial distribution, Kimani et al 7 showed that using stage 1 prevalence estimates in the expressions for the unbiased estimators provides unbiased estimates for the treatment effects. This extends to the case of more than two partitions, where numbers of patients in partitions are taken to have a multinomial distribution. The proof is based on the fact that the estimator in a partition is unbiased conditional on the number of patients in an interval and that the proportion of patients in a partition is unbiased for the prevalence in the partition. The proof for the case of estimating the cut-off values using stage 1 data is similar.
Conditional on continuing to stage 2, we have derived estimators for the effect in the selected subpopulation. Continuing to stage 2 is necessary for the unbiased estimators. This is not the case for the other estimators as they involve obtaining stage 1 estimates in all partitions that correct for the subpopulation selection and then combine them with the stage 2 unbiased estimates. Hence, estimates for effects in the dropped partitions that correct for subpopulation selection can be obtained using the shrinkage and bias-adjusted estimators. However, they are not necessarily mean unbiased.
Methods developed for normally distributed data following treatment selection have been adapted for time-to-event data. 28 Even after assuming asymptotic normality of the log hazard ratio, some of the estimators we have derived such as the UMVCUE may not be valid for time-to-event data. For example, if there is a quantitative interaction with hazard ratios in different partitions being unequal, a model that accounts for this is required. In this case, obtaining separate estimates for each partition is the valid approach.
Finally, since in all simulations, the combined stages 1 and 2 sample size was 800, for the different stage 1 sample sizes considered, there would be no savings or losses in terms of the cost of treating patients. The saving/loss is only made in terms of costs associated with biomarker testing. Hence, the case for performing subpopulation selection with a small proportion of patients can be justified if the biomarker is expensive, leading to savings if F is selected. The case for performing subpopulation selection with a large proportion of patients is justifiable if the biomarker is not expensive. In this case, the resources loss is not substantial if F is selected and yet, if only a part of the population will benefit, there is a higher probability of making the right decision that may improve power. The setting of fixed total sample size is sometimes referred to as enrichment because if some partitions are dropped in stage 2, the number of patients recruited from partitions in stage 2 is higher than if more partitions were selected. To save money on treatment costs or reduce the total sample size, subpopulation selection could be performed early, with no enrichment in stage 2. With no enrichment,