Point and interval estimation in two-stage adaptive designs with time to event data and biomarker-driven subpopulation selection

In personalized medicine, it is often desired to determine if all patients or only a subset of them benefit from a treatment. We consider estimation in two-stage adaptive designs that in stage 1 recruit patients from the full population. In stage 2, patient recruitment is restricted to the part of the population, which, based on stage 1 data, benefits from the experimental treatment. Existing estimators, which adjust for using stage 1 data for selecting the part of the population from which stage 2 patients are recruited, as well as for the confirmatory analysis after stage 2, do not consider time to event patient outcomes. In this work, for time to event data, we have derived a new asymptotically unbiased estimator for the log hazard ratio and a new interval estimator with good coverage probabilities and probabilities that the upper bounds are below the true values. The estimators are appropriate for several selection rules that are based on a single or multiple biomarkers, which can be categorical or continuous.


More selection rules and details for the expressions of the bounds
The expressions for the bounds are a function of the selection rule, selected partitions and stage 1 estimates. However, since for a trial a single subset S is selected, for each j ∈ S, we simplified notation for the bounds to be l j and w j rather than, for example, l j,S and w j,S . As examples, we describe the expressions for l j and w j for the selection rules in Section 2.3 in the main paper. For K = 2, the points corresponding to the expressions for l j and w j are illustrated in Figure  S1. While estimating θ 1 , the lower and upper edges of the vertical dashed and dotted lines that go through the stage 1 estimates correspond to l 1 (lower bound forθ 1,1 ) and w 1 (upper bound for θ 1,1 ), respectively. For estimating θ 2 , the right hand and left hand edges of the horizontal lines that go through the stage 1 estimates correspond to l 2 (lower bound forθ 1,2 ) and w 2 (upper bound for θ 1,2 ), respectively. Consequently from Figure S1(a), for K = 2, for the adaptive threshold enrichment design, when only partition 1 is selected, l 1 = (b − p 2θ1,2 )/p 1 and w 1 = b, while when both partitions are selected l 1 = −∞, w 1 = (b − p 2θ1,2 )/p 1 , l 2 = −∞ and w 2 = (b − p 1θ1,1 )/p 2 . Let p j = ∑ j i=1 p i ( j = 1, ..., K). For the adaptive threshold enrichment design, we show in the next section that for any K ≥ 2, when a subpopulation consisting of s (s = 1, ..., K) partitions is selected, for each j ∈ {1, ..., s} = S, ,i is set to zero when s = 1) and with l j set to be −∞ if all partitions are selected. From Figure S1(b), for the second selection rule in Section 2.3 in the main paper where all partitions whose stage 1 estimates are ≤ b are selected, for any S ⊆ {1, ..., K}, for each j ∈ S, l j = −∞ and w j = b.
Kimani et al. [7] consider the case of two partitions, that is K = 2, and where the trial always continues to stage 2. Their selection rule assumes a monotonic relationship between the biomarker values and the treatment effect so that the trial continues to stage 2 with partition 1 only or F. When the selection rule is modified so that a smaller estimate is desired, partition 1 is selected ifθ 1,1 < θ 1,2 − m/p 2 , where m is a pre-specified positive number. Otherwise the trial continues with F. The decision regions when p 1 = p 2 are shown in Figure S1(c). For the case of selecting partition 1, l 1 = −∞ and w 1 =θ 1,2 − m/p 2 . For the case of selecting F, l 1 =θ 1,2 − m/p 2 , w 1 = ∞, l 2 = −∞ and w 2 =θ 1,1 + m/p 2 .
Kunzmann et al. [10] selection rule replace m/p 2 in the Kimani et al. selection rule with a function d α (·) that encompasses a class of selection rules such as using stage 1 p-value(s) for selection, with the Kimani et al. rule being a special case. Consequently, for Kunzmann et al. selection rule, we replace m/p 2 with d α (·) in l 1 , l 2 , w 1 and w 2 defined for Kimani et al..

The bounds for the adaptive threshold enrichment design
The case of selecting the full population (K partitions) The full population (F) is selected if p 1θ1,1 + p 2θ1,2 + · · · + p Kθ1,K ≤ b which implies that for j ( j = 1, ..., K)θ When F is selected, there is no lower bound for p 1θ1,1 + p 2θ1,2 + · · · + p Kθ1,K . Therefore, when F is selected, l j <θ 1, j ≤ w j ( j = 1, ..., K), where l j = −∞ and The case of selecting the subpopulation consisting of K − 1 partitions The subpopulation consisting of K − 1 partitions is selected if the following two conditions are satisfied • p 1θ1,1 + p 2θ1,2 + · · · + p Kθ1,K > b.
• p 1θ1,1 +p 2θ1,2 +···+p K−1θ1,K−1 For a selected partition j ( j = 1, ..., K − 1), the first condition implies that and the second condition implies that Hence, when K − 1 partititions are selected, l j <θ 1, j ≤ w j ( j = 1, ..., K − 1), where The case of selecting the subpopulation consisting of K − 2 partitions The subpopulation consisting of K − 2 partitions is selected if the following three conditions are satisfied For a selected partition j ( j = 1, ..., K − 2), the first condition implies that the second condition implies thatθ and the third condition implies thatθ From the three conditions, when K − 2 partitions are selected, l j <θ 1, j ≤ w j ( j = 1, ..., K − 2), where The case of selecting the subpopulation consisting of s partitions . In general, following the above description of how to obtain bounds for the cases of F, K − 1 partitions and K − 2 partitions, it can be seen that when the subpopulation consisting of s (s ∈ {1, ..., K}) partitions is selected, l j <θ 1, j ≤ w j ( j = 1, ..., s) where 5 Worked example 5.1 Format of the data Table S1 shows the template of how the data may be formatted. We created a dataset with variables "Group" (1 for C-MTX and 0 for HDMTX), "Partition" (1 for high risk patients and 2 for low and intermediate risk patients) and "Stage" (1 and 2 for stages 1 and 2 patients, respectively). The dataset also includes the variables "SurvTime1" and "Status1" which are the survival time and censoring status (1 for DFS event and 0 for no DFS event) for stage 1 patients 2176 days after the first enrolment (t 1 is the date corresponding to this day). The other key variables are "SurvTime2" and "Status2" which are the survival time and censoring status (1 for DFS event and 0 for no DFS event) 3455 days after the first enrolment for stage 1 patients (t 1 is the date corresponding to this day) and 3644 days after the first enrolment for stage 2 patients (t 2 is the date corresponding to this day).
(c) Combined evidence (stages 1 and 2) Figure S2: Hypothesis testing based on closure principle and combination of p-values. The weights w 1 = √ 0.75 and w 2 = √ 0.25. The alternative hypothesis is that the log hazard ratios are greater than 0. These are required for determining the lower limits of the simultaneous duality confidence intervals.

R code for computing estimates for data that includes a bootstrap sample
In this section, we describe the R code used to obtain the various estimates for the example data combined with a boostrap sample of the same size as the example data. The estimates are summarised in Table S2. The R Code to obtain the point estimates and the naive confidence intervals can be deduced from the R code in Section 5.2.
The estimates required to obtain the duality confidence intervals are summarised in Figure S4. The p-values required to compute the lower bounds are not presented but can be computed using estimates in the figure. Based on them, it can not be concluded that the log hazard ratios in each of Partitions 1 and 2 are greater than 0 and so the lower bounds are obtained using the same R code as 5.2.7 (v) above, which we do not repeat. We give the R code for obtaining the upper bounds below.

Assessing the impact of a slower recruitment rate
To assess the impact of a slower patient recruitment so that the trial is longer, for γ = 0.5, we performed simulations where stages 1 and 2 are each approximately 2 years. In this case, the required sample is smaller and so we assume up to 1700 patients can be recuited uniformly over 4 years., with t 1 corresponding to 630 days after the interim analysis. The average additional follow-up events from stage 1 censored observations in the selected partitions was approximately 40. The selection probabilities (not reported) are similar to those in the main paper Table 2, which is the case of a quicker recruitment rate. The simulation results for point estimates are given in Table S8 and Figure S7. Generally, the results are similar to those in the main paper (Table 4 and Figure 4) and hence UMVCUE witht 1 > t 1 remain the best point estimator for a longer trial.

The impact of the number of events from stage 1 patients without events at the interim analysis
To assess the impact of the number of the number of events from stage 1 censored observations, for γ = 1 (Exponential distribution) and γ = 1.5, we peformed simulations witht 1 corresponding to 250 days and 150 days, respectively, after the interim analysis. The average number of additional deaths in the selected partitions was approximately 80, compared to approximately 40 in the other scenarios. The properties of the point estimators are summarised in Tables S10 and S12, and Figures  S8 and S9. As expected, the naive estimator witht 1 > t 1 (θ N j ) has smaller biases and RMSEs for the cases with more events from stage 1 patients without events at the interim analysis (Tables S10 and S12, and Figure S8 and S9) than the case of fewer events for stage 1 patients without events at the interim analysis (Table S4 and Figure S5 for γ = 1.0, and Table S6 and Figure S6 for γ = 1.5). Also, the UMVCUE (θ U j ) witht 1 > t 1 exhibits less bias and RMSE for the former and consequently much better properties than the UMVCUE (θ U j ) witht 1 = t 1 when there are more events from stage 1 patients without events at the interim analysis. Hence, we recommend the UMVCUEθ U j evaluated att 1 > t 1 .

Trials with fewer events
To assess the impact of the total number of events in a trial, for γ = 0.5, we performed simulations for the case of 200 events in each of stages 1 and 2. We used the same recruitment rate as the scenario considered for the results of Table 4 and Figure 4 in the main paper. The average number of the extra follow-up events for censored stage 1 observations in the selected partitions is 30. The biases of the naive point estimatorsθ N j (Table S14 and Figure S10) tend to be higher than the comparable results in the main paper (Tables 4 and Figure 4). This is attributable to the higher variability of stage 1 estimates and hence lower probabilities of making the ideal decisions as well as the smaller size of the unbiased stage 2 data. However, the UMVCUEθ U j fort 1 > t 1 is still the preferred point estimator.

Performing subpopulation selection earlier in the trial
To assess the impact of conducting subpopulation selection earlier in the trial, we performed simulations with the interim analysis done after 200 events and the trial stops after 400 events from patients recruited in stage 2 (in most of the previous results it is 300 events in each stage), witht 1 corresponding to 320 days after the interim analysis. The average number of events from stage 1 patients without events at the interim analysis in the selected partitions was approximately 40. The results of the point estimators are given in Table S16 and Figure S11. Compared to the case of 300 events in each stage (Table 4 and Figure 4 in the main paper), the biases of the naive estimtors are smaller which is attributable to more weight for the unbiased stage 2 data. As in the other scenarios, havingt 1 > t 1 and using the UMVCUEθ U j provides the best point estimates.    9 Weibull distribution (γ = 1.5) simulation results Table S6: Simulated biases and mean squared errors of the estimators for the log hazard ratios (Weibull distribution, γ = 1.5)

Simulated bias
Root mean squared error Selectedθ N jθ U jθ N jθ U j partitions (S) Partitiont 1 > t 1t1 = t 1t1 > t 1t1 = t 1t1 > t 1t1 = t 1t1 > t 1t1 = t 1 True log hazard ratios: θ 1 = θ 2 = θ 3 = θ 4 = 0.0198   10 Slower recruitment rate with Weibull (γ = 0.5) distribution simulation results Table S8: Simulated biases and mean squared errors of the estimators for the log hazard ratios with the Weibull (γ = 0.5) distribution when the trial recruits for approximately 2 years in each stage (slower recruitment rate) Simulated bias Root mean squared error   11 Exponential (γ = 1.0) distribution results witht 1 corresponding to 250 days after the interim analysis simulation results Table S10: Simulated biases and mean squared errors of the estimators for the log hazard ratios (Exponential distribution) witht 1 corresponding to 250 days after the interim analysis) Simulated bias Root mean squared error  Figure S8: Boxplots for estimates in partition 1 (panels a, b and d) and partion 2 (panel c) when the full population is selected to continue to stage 2 for exponential distribution (γ = 1.0) andt 1 corresponds to 250 days after the interim analysis. The horizontal dashed and dotted line in each plot corresponds to the true log hazard ratio of the partition considered in that plot while the true log hazard ratios in all partitions are given below the plot.  12 Weibull (γ = 1.5) distribution results witht 1 corresponding to 150 days after the interim analysis simulation results Table S12: Simulated biases and mean squared errors of the estimators for the log hazard ratios (Weibull (γ = 1.5) distribution) witht 1 corresponding to 150 days after the interim analysis) Simulated bias Root mean squared error Selectedθ N jθ U jθ N jθ U j partitions (S) Partitiont 1 > t 1t1 = t 1t1 > t 1t1 = t 1t1 > t 1t1 = t 1t1 > t 1t1 = t 1 True log hazard ratios: θ 1 = θ 2 = θ 3 = θ 4 = 0.0198  Figure S9: Boxplots for estimates in partition 1 (panels a, b and d) and partion 2 (panel c) when the full population is selected to continue to stage 2 for Weibull (γ = 1.5) distribution andt 1 corresponds to 150 days after the interim analysis. The horizontal dashed and dotted line in each plot corresponds to the true log hazard ratio of the partition considered in that plot while the true log hazard ratios in all partitions are given below the plot.  13 Simulation results when there are fewer events in a trial Table S14: Simulated biases and mean squared errors of the estimators for the log hazard ratios with the Weibull (γ = 0.5) distribution when there are fewer events (200 events at interim analysis, 200 events from stage 2 patients andt 1 corresponding to 300 days after interim analysis which approximately an average of 30 additional events in selected partitions from stage 1 patients without events at the interim analysis).

Simulated bias
Root mean squared error   14 Simulation results when subpopulation selection is performed earlier in the trial Table S16: Simulated biases and mean squared errors of the estimators for the log hazard ratios when subpopulation selection is performed earlier in the trial (200 events at interim analysis, 400 patients from stage 2 patients and an average of 40 additional events in selected partitions from the censored stage 1 observations) Simulated bias Root mean squared error   15 Different selection rule: selecting a partition independent of the results in the other partitions

Simulation results for a different selection rule
To assess the characteristics of the various estimators when a different selection rule is used, we performed simulations for the case of continuing with any partition whose stage 1 log hazard ratio estimate is ≤ 0. The other aspects of the simulations are the same as those used to obtain the results in Table 4 in the main paper, where γ = 0.5, the number of events in each stage is 300, up to 2200 patients can be recruited for up to two years and three configurations for  Tables S18 to S20, respectively. The biases of the naive point estimatorθ N j are positive in all cases. This is because a partition is selected if it has a positive effect. In several scenarios, biases are larger than in the case of the adaptive threshold enrichment design (Results in Section 5.2 in the main paper). Whent 1 > t 1 , the UMVCUEθ U j is slightly biased in some cases but has smaller RMSE than whent 1 = t 1 . Hence we recommend havingt 1 > t 1 and usingθ U j to obtain estimates. The simultaneous properties of the naive and the duality confidence intervals are summarised in Table S21.
For the two scenarios where the values for (θ 1 , θ 2 , θ 3 , θ 4 ) are (0.0198, 0.0198, 0.0198, 0.0198) and (−0.4055, −0.2231, −0.0953, 0) , unlike the naive confidence regions, the duality confidence regions have at least 95% coverage and the probabilities that at least one upper bound is less than the true value are less than 2.5%. For the other scenario of (θ 1 , θ 2 , θ 3 , θ 4 ) equal to (−0.2231, −0.0953, 0.3364, 0.4055) , the simulated probabilities (not reported in the table) for S equal to / 0, {1, 2}, {1} and {2} are 5.9%, 48.1%, 25.5% and 10.9%, respectively. In these cases which constitute more than 90%, the duality confidence regions have at least 95% coverage probability and the probabilities that at least one upper bound is less than the true value are less than 2.5%. For the remaining cases, the naive confidence intervals have undesirable properties because the coverage probabilities are as small as 88% and the probabilities that at least one upper bound is less than the true value are as high as 12%. The coverage probabilities for the simultaneous duality confidence intervals are generally at least the target 95% but the probabilities that at least one upper bound is less than the true value are mostly above 2.5%, although much smaller than those of the naive confidence intervals. We note that this is driven by the upper bounds for the treatment effects in Partitions 3 and 4. This may be considered to be of less practical impact since the log hazard ratios in these partitions are above 0 and the upper bounds are also mostly above 0 so that the new treatment would not be recommended in partitions 3 and 4. The reason that the duality upper bounds for the effects in partitions 3 and 4 do not show the desired properties is because the hypothesis testing described in Section 3.2 in the main paper does not control the type I error rate conditional on the selection made but controls the probability of selecting any partition where the treatment is not effective and concluding it is effective. When the treatment is effective in some partitions and not in others, conditional on the selection, the type I error rate is above the target 2.5%. Hence, since we assessed the properties of the simultaneous confidence intervals conditional on the selection made, tail probabilities for such scenarios can be above 2.5%.

Bigger treatment effects
To assess the properties of the estimators for the case of bigger treatment effects, we performed simulations when (θ 1 , θ 2 , θ 3 , θ 4 ) is (−1.0986, −0.9163, −0.6931, 0) , which corresponds to the hazard ratios vector (1/3, 4/10, 1/2, 1) . We used the selection rule used in Section 5.3 in the main paper, that is, selecting all partitions that have positive stage 1 estimated effects. Unlike in other simulations, because the probabilities of selecting some subpopulations such as selecting partitions 1 and 2 are small, we simulated 1,000,000 trials. In most simulated trials, all partitions (49.97%) or partitions 1 to 3 (49.85%) were selected to continue to stage 2. Note that this means that partitions 1 to 3 are selected with a high probability (close to 1) and consequently the selection biases in these partitions would be expected to be neglible. The simulation results are given in Table S22 and Figure  S12. Focusssing on the figure, which gives the Boxplots of the estimates in the different partitions when the full population is selected, for partitions 1 to 3, all estimators have approximately the same distribution which suggests neglible selection bias. For partition 3, all the estimators are practically mean unbiased. For partitions 1 and 2, all the estimators underestimate treatment effects with the underestimation bigger in partition 1 where the true log hazard ratio is largest. For partition 2 which corresponds to hazard ratio of 0.4, the bias is very small. We note that for partitions 1 and 2 most estimates are far from zero (point of no treatment effect) and so for an adequately powered trial, the underestimation is unlikely to lead to undesired clinical decision. Since the selection bias is negligible, and the biases increase with the true treatment effect, we attribute the underestimation to the asymptotic distribution of the score statistic being more accurate when the true log hazard ratio is small. In partition 4 where the control and the experimental treatment are equally effective, the naive estimators are positively biased but this is corrected adequately by the UMVCUE. So overall, the UMVCUE performs best in all partitions. Table S18: Simulated biases and mean squared errors of the estimators for the log hazard ratios for the Weibull (γ = 0.5) distribution when θ 1 = θ 2 = θ 3 = θ 4 = 0.0198 (Configuration 1) and a partition is selected independent of the results in other partitions.

Simulated bias
Root mean squared error  Table S19: Simulated biases and mean squared errors of the estimators for the log hazard ratios for the Weibull (γ = 0.5) distribution when θ 1 = −0.2231, θ 2 = −0.0953, θ 3 = 0.3365 and θ 4 = 0.4055 (Configuration 2) and a partition is selected independent of the results in other partitions.  Table S20: Simulated biases and mean squared errors of the estimators for the log hazard ratios for the Weibull (γ = 0.5) distribution when θ 1 = −0.4055, θ 2 = −0.2231, θ 3 = −0.0953 and θ 4 = 0 (Configuration 3) and a partition is selected independent of the results in other partitions.

Simulated bias
Root mean squared error Selectedθ N jθ U jθ N jθ U j partitions (S) Partitiont 1 > t 1t1 = t 1t1 > t 1t1 = t 1t1 > t 1t1 = t 1t1 > t 1t1 = t 1 17 Confidence intervals with bigger treatment effects ‡ ‡ † Type I error is the probability that at least one upper bound is less than the true value. * Results based on 130 simulated trials; ‡ Not selected.