Because of the complexity of cancer biology, often the target pathway is not well understood at the time that phase III trials are initiated. A 2-stage trial design was previously proposed for identifying a subgroup of interest in a learn stage, on the basis of 1 or more baseline biomarkers, and then subsequently confirming it in a confirmation stage. In this article, we discuss some practical aspects of this type of design and describe an enhancement to this approach that can be built into the study randomization to increase the robustness of the evaluation. Furthermore, we show via simulation studies how the proportion of patients allocated to the learn stage versus the confirm stage impacts the power and provide recommendations.
In this era of precision medicine, advances of predictive and prognostic biomarker development in clinical oncology have become so unprecedented that it is now routine to incorporate novel biomarker plans in the hope of achieving better identification of a target population. Regulatory agencies, such as the Food and Drug Administration and European Medicines Agency, have been encouraging effective use of biomarkers in cancer drug development, by issuing scientific guidelines[1, 3] and initiating national efforts such as the AACR-FDA-NCI Cancer Biomarkers Collaborative.[3] Because of the inherent complexity of cancer biology, at the planning stage of most oncology phase III trials the target pathway is not well understood, and a predictive biomarker to identify sensitive subjects is unavailable. Trials are commonly conducted in an all-comer population, and while baseline tissue/blood samples are often collected, analyses of genomic or proteomic markers for use in patient selection are typically purely exploratory. In particular, exploratory post hoc subgroup identification is often performed, and such retrospective analyses cannot be used to seek regulatory approval. Freidlin and Simon[4] proposed an adaptive signature design (ASD) that prospectively combines the identification of sensitive subgroups and the test for overall treatment effect within a single randomized trial. The objective is to maximize the probability of achieving a positive trial by not only looking at the overall population treatment effect but by obtaining prospective confirmation of an important subgroup (if one exists) as well. The method splits the trial into two: first, a “learn” stage, where, in a prespecified partition of the overall population, assay or signature(s) are sought which identifies a subset of patients who benefit, or benefit most, so that a “classifier” can be developed. The second stage is then used to seek to “confirm” that the classifier does indeed identify a subset of patients who benefit, or benefit most, while simultaneously testing for the treatment effect in the overall population (using data from both learn and confirm stages). Simulations have shown that this 2-stage design will not compromise the power to detect an overall treatment effect.
The practical application of this is potentially broad. For instance, suppose a phase III clinical trial in oncology is being conducted to confirm the safety and efficacy of a new potential drug. During the course of development of the new potential oncolytic drug, extensive preclinical work, and some clinical work, has been conducted to try to identify genomic or proteomic markers that are predictive of a treatment effect. In view of the limited clinical data available from phase II studies, which as is typical in oncology, were in the range of 100 to 150 patients, there is serious interest in 3 to 10 different potential biomarkers; some binary, some ordered categorical, and some continuous. Currently, none of these potential biomarkers have sufficiently strong prior data or rationale to warrant either restricting enrollment to a subpopulation, or allocating some of the overall significance level alpha (α, normally set at .05) to a predefined subgroup. Consequently, any analyses run at the end of the study will be exploratory and need to be confirmed in a new study, unless we can build the ASD into the trial, so that we can learn and confirm in the same study.
Furthermore, the practical relevance of this type of design is illustrated by its usage in the following real example; Garon et al,[5] in which the efficacy and safety of programmed cell death 1 inhibition with pembrolizumab in patients with advanced non–small-cell lung cancer enrolled in a large Phase I study was evaluated. The sponsor sought to define and validate an expression level of the programmed cell death 1 ligand 1 that is associated with the likelihood of clinical benefit and so assigned 495 patients receiving pembrolizumab to either a learn set (182 patients) or a confirm set (313 patients).
In this article, we address 2 important considerations concerning the application of the ASD. Firstly, how many patients should be allocated to learn versus the confirm stage and secondly, how to allocate patients to each stage.
2 DESIGNS AND METHODS
Friedlin and Simon's novel design consists of 3 components in 2 stages. In the “learn” stage, an assay or signature is applied to a prespecified partition of the overall population to identify a subset of the most treatment-beneficial patients so that the “classifier” can be determined for subsequent use. A classifier could be a single binary biomarker (eg, presence or absence of a specific genetic mutation), or a single ordered categorical or continuous biomarker with associated cutoff value (eg, protein expression for a specific protein, exceeding some level), or a rule involving multiple biomarkers, for instance, as a linear combination with an associated cutoff, or rules such as absence of 1 characteristic along with presence of another. In the “confirm” stage, 2 prespecified tests are performed: 1 test is on all patients, and a second test is on the patients not used in the learn stage that are classified as sensitive using the classifier developed in the learn stage (that is the marker-positive subgroup, denoted “M+ subgroup”). The former test is similar to the traditional “all-comer design,” but with a reduced α, while the latter uses a relatively stringent α. A study is considered successful if either of the 2 tests is significant.
The performance of ASD was evaluated by simulation studies under different assumptions of some key factors. The allocation ratio for patients in the learn or confirm stage, which directly affects the test powers in the all-comers and M+ subgroup, needs to be prespecified to preserve trial integrity. Freidlin and Simon[4] acknowledged that many unknown parameters may influence the optimal ratio, but nonetheless, they recommended equal allocation based on the robust performance across different settings (see table 4A-D of Freidlin and Simon[4]). Literature on splitting datasets into training and testing sets is generally described in the context of identifying a classifier and minimizing mean squared error of prediction and optimality often depends on factors such as sample size and number of predictors. One rule of thumb is to allocate 2/3 of the overall population to the training set and 1/3 to the test set in most circumstances.[6] However, 1 key constraint in the 2-stage design setting is that the number of patients not used in the learn stage needs to be large enough for the test for treatment effect in the confirm stage to be statistically significant.
For a total of N patients to be enrolled in a study, let k be the proportion of patients in the learn stage. The number of patients in the confirm stage is, therefore, N(1 − k). The value of k should be determined with caution. On the one hand, if it is too small, the design is unlikely to be powered to identify the true marker(s) in the learn stage, and the decreased α for detecting an overall effect in the confirm stage would make the 2-stage design disadvantageous compared to the traditional design. On the other hand, if it is too large, the true treatment effect in the M+ subgroup may not be confirmed due to lack of power, and thus, α spent in the learn stage is wasted. Determination of the optimal allocation ratio between the 2 stages is one of the primary focuses of our current research.
Another important decision that Freidlin and Simon[4] mentioned without in-depth discussion is the split of significance level α between all-comers testing and the M+ subgroup testing. For a given significance level to test the overall treatment effect (normally pre-specified at α = 0.05), they recommended setting α1 = 0.04 (80% of α) for the overall population and α2 = 0.01 (20% of α) for the M+ subgroup. We investigated additional combinations via simulation.
Our simulation study is not intended to be exhaustive but rather to demonstrate that such a simulation study should be conducted prior to implementation in any phase III study, since the power of the study is dependent on this allocation, and to demonstrate that this can be chosen so as to maximize the power across plausible alternatives. We have developed an open-source R package “simASD” to facilitate each study-specific simulation, with more details in Section 5 below.
3 POWER EVALUATIONS
3.1 Virtual data generation specifications
We studied the ASD procedure in the context of there being a “sensitive” subgroup receiving a treatment benefit (denoted as being from the M+ population), with the remaining patients receiving either no benefit, or modest detriment (denoted as being from the M− population). Across all simulation scenarios, we assumed that the proportion of population falling into the M+ subgroup was 40% and in each case, assumed that hazard ratios (HRs) for the M+ and M− populations that give rise to an overall HR (i.e., an HR in the overall mixed population) of 0.87. While it should be noted that a mixture of 2 populations, each with a constant HR between treatments, will not generally have a constant HR, the effect of this should not qualitatively alter our conclusions.
Table 1 provides virtual data generation parameter specifications. Three potential sample sizes for phase III oncology trials were used, with 1:1 randomization ratio for the 2 treatment arms (placebo and treatment). Three (x1 − x3) and 10 (x1 − x10) continuous uncorrelated biomarkers (values from the uniform distribution ranging from 0 to 1) were generated, and in both cases only x1 was the true predictive biomarker. The marker subgroups (M+ and M−) were defined by a step function with the cutoff value at 0.40 (ie, M+ if x1 ≤ 0.40, and M− if x1 > 0.40). Because of random variability in x1, the percentage of subjects in the M+ subgroup for a single virtual dataset was close to—but not exactly—40%.
Table 1. Virtual data generation parameter specifications
Parameters
Possible Values or Levels
Comments
Total sample size
700, 1400, 2100
Randomization ratio 1:1
Number of biomarkers
3, 10
Only x1 is the true biomarker
Predictive effect (“scenario”)
Moderate
HR in M+ = 0.71
HR in M− = 1.00
Strong
HR in M+ = 0.60
HR in M− = 1.11
Strongest
HR in M+ = 0.54
HR in M− = 1.20
The simulation setup for the sample sizes, baseline survival function, hazard ratios of interest, will be specific to the context of the study design context being considered. Here, 3 predictive effects (“scenarios”) were considered: moderate, strong, and strongest. From the moderate scenario to the strongest scenario, the HR of treatment over placebo in M+ decreases from 0.71 to 0.54, while the HR in M− increases from 1.00 to 1.20. In each of the 3 scenarios, we simulated 100 datasets, and used piecewise exponential as the baseline survival function (3 segments, breakpoints at 193.33 and 350.67 days). The median survival times (in days) in the control arm were 439.64 (segment 1), 203.32 (segment 2), and 154.62 (segment 3). The Cox proportional hazard model[7] with an overall HR of 0.87 was assumed for the response. The maximum follow-up per subject was 547.5 days, and the censoring rate prior to maximum follow-up was 20%. The datasets and scenario characteristics were chosen such that with n = 1400, the all-comer analysis was moderately powered at α1 = 0.04 and M+ analysis was highly powered in the moderate scenario at α2 = 0.01. With both the maximum follow-up per subject and censoring prior to maximum follow-up taken into account, about 70% subjects experienced an event.
3.2 Test implementation details
The final analysis consists of (1) overall comparison of treatment arms in the all-comers population, using data from all randomized patients (ie, including both learn and confirm sets) conducted at a significance level α1 and (2) a comparison of treatment arms in the selected subset of sensitive patients in the confirm set, performed at significance level α2.
The marker used to select patients in the confirm set in (2) was determined using the learn set of patients. For a given learn/confirm subject allocation, we used the learn-stage subset and dichotomized each marker xi (i = 1, 2, 3 or 1, 2, N, 10) by a fixed quantile. For simplicity, we used 3 fixed quantiles (0.25, median, and 0.75) for the biomarker cutoffs (roughly corresponding to biomarker values of 0.25, 0.50, and 0.75, respectively). A Cox proportional hazard model was fitted including the treatment, the marker xi, and their interaction (note: a single xi instead of all markers was used in the single-marker analysis). We consistently set α = 0.05 at the learn stage, and no multiplicity adjustment was made because this is an exploratory stage. If the marker-by-treatment interaction was significant, then this marker was a potential predictive marker. In cases where multiple markers had significant interaction effects, the marker with the most significant interaction effect was chosen. To ensure that one and only one marker was selected in the learn stage, if all treatment-by-marker interactions were nonsignificant at α = 0.05, we still chose the marker with the smallest interaction P value. Therefore, at the end of the learn stage, 1 single marker was identified with a cutoff value.
At the confirm stage, we specified 2 α levels: α1 for all-comers test and α2 for the M+ subgroup test, where α = α1 + α2 = 0.05. This ensured control of the overall α level for the procedure. Within the confirm-stage subjects, we used the marker and cutoff selected at the learn stage to identify the M+ subgroup. We then performed 2 tests for treatment effect: (1) for all-comers (data from both stages combined) at α1 and (2) for M+ subgroup subjects at α2. The all-comer test was identical to the traditional 1-stage design, except for a reduced significance level (α1 < α). In the M+ test, no marker was involved in the Cox model as a covariate, and no further alpha adjustment was made as only a single subgroup was considered. The study was considered positive if either of the 2 tests was significant.
For each combination of predictive effect, total sample size, and number of biomarkers, we considered 5 allocation percentages for the learn/confirm stages (learn set allocation P increased from 30% to 70% with an increment of 10%), and 4 α allocation splits for the all-comers and M+ subgroup in the confirm stage (α1 allocation for the all-comers analysis increased from 0.025 to 0.04 with an increment of 0.005, while keeping the overall significance level α = α1 + α2 fixed at 0.05). Table 2 provides a summary of test implementation specifications.
Table 2. Test implementation specifications
Parameters
Possible Values or Levels
Comments
Learn/confirm allocation, %
30/70, 40/60, 50/50, 60/40, 70/30
α allocation, all-comer/M+
0.025/0.025, 0.03/0.02, 0.035/0.015, 0.04/0.01
Allocation at confirm stage
Biomarker cutoff values
0.25 quantile, median, 0.75 quantile of xi
=1 if xi < cutoff; 0 otherwise
3.3 Simulation results
Simulations results for the three sample sizes (in each dataset) of 700, 1400, and 2100 are shown in Figures 1, 2, and 3, respectively. The empirical powers (y axis) are plotted against the learn/confirm sample size allocations (x axis) for the 3 scenarios (by color or shape) and 4 α allocations (by line types). In each figure, the left and right column show results when the number of biomarkers equals 3 and 10, respectively. Each row shows one fixed biomarker cutoff value. For direct comparisons, we also superimposed 3 horizontal dashed lines to indicate the traditional 1-stage design powers (in which no subgroup analysis was embedded) under the 3 scenarios. Detailed numerical values for all simulations are provided in tabular format in the Supporting Information and available online.
Relationship of empirical power and learn/confirm sample size allocations under 3 scenarios (sample size = 2100)
3.3.1 Impact of learn/confirm allocation
As the learn/confirm allocation changes from 30/70 to 70/30 (%), a general decreasing trend of the empirical power is seen when the biomarker cutoff values (first quartile, which is around 0.25, and median, which is around 0.50) are closer to the truth (0.40) and when the number of biomarkers is small (p = 3). This decreasing trend is also generally consistent across different α allocations. With an increased number of biomarkers (p = 10), the same decreasing trend holds but the powers are reduced. This pattern, however, becomes blurred when the biomarker cutoff value (third quartile, which is around 0.75) is far from the truth (0.40). The powers are similar regardless of the learn/confirm allocation, though 70/30 (%) seems to perform worst. In sum, with a sample size of 1400 in a large Phase III oncology trial (ie, 700 in each arm), a relatively accurate estimate of the biomarker cutoff, and a restricted number of biomarkers, the learn/confirm allocation of 30/70 or 40/60 (%) in the 2-stage design offers the greatest power advantage irrespective of predictive effect strength.
3.3.2 Impact of different splits of the α
When the sample size equals 700 or 1400, the highest power is generally observed under the “even split” of α (ie, 0.025/0.025) in the strongest scenario. However, the even split of α often results in the lowest power in the moderate and strong scenarios. There is no evident pattern that a particular α split dominates across all situations. In some cases, especially at the largest sample size, the power difference is negligible across the 4 α allocations (eg, n = 2100, median cutoff), while in other cases the power difference is evident (eg, n = 700, median cutoff). In general, the difference becomes smaller as sample size increases.
3.3.3 Impact of the number of biomarkers considered in the learn stage
Not surprisingly, power decreases as the number of biomarkers increases (due to an increased chance of falsely identifying a non-predictive biomarker). However, the effect of changing the learn/confirm allocation has a similar relative effect for 10 biomarkers compared with 3. Take the strongest scenario, for example, (sample size = 700, median cutoff, all-comer α = .025): the power drops by about 35% in both cases, from 0.59 to 0.37 when p = 3, and from 0.50 to 0.33 when p = 10. When the biomarker has a strong or strongest predictive effect, such trends are generally consistent for all sample sizes. Even with an increased number of biomarkers, the power gain of the 2-staged design over the 1-stage design is evident when the predictive effect of the biomarker is strong or strongest and the cutoff used is not too far from the truth (ie, first quartile or median). This is illustrated by the 2-stage design power curves lying consistently above the corresponding horizontal dashed lines that represent 1-stage design powers.
3.3.4 Comparison of the 2- and 1-stage design
In the strongest scenario, 2-stage powers dominate 1-stage powers in almost all situations. The exception is a few cases where the learn/confirm allocation is 70/30 (%). When the median is used as the biomarker cutoff, and optimal values of other parameters are selected, the power increase in the 2-stage design over the 1-stage design is substantial: 0.59 vs 0.21 when n = 700, 0.89 vs 0.45 when n = 1400, and 0.98 vs 0.68 when n = 2100. In contrast, in the moderate scenario the 2-stage design results in very limited power gain, or even power loss. The same outcome often occurs in the strong scenario if the biomarker cutoff was poorly determined (ie, third quartile). We included this “deviated” cutoff value in our simulations to demonstrate the negative impact of a poorly selected biomarker on the 2-stage design: the learn/confirm allocation then becomes irrelevant as a change in sample size in the learn stage has almost negligible effect on the subsequent confirm stage due to poor subgroup identification; the power loss is resulted from the all-comer test being done at a reduced significance level and sensitive patients in the confirm stage being incorrectly selected. We provided more details on identifying sensitive subpopulations in Section 5 below.
4 ALLOCATION OF PATIENTS
Firstly, patients cannot be allocated in a way that introduces systematic differences between learn and confirm sets, so the allocation must be random and not associated with any time trend or geographical for instance. The allocation should also be implemented and documented in such a way as to avoid any possibility of bias. A simple randomization of patients at the start of the trial is thus an option. However, we also want both learn and confirm sets to be internally balanced with respect to treatment arm, and important prognostic factors. Furthermore, we want learn and confirm sets themselves to be similar with respect to the prognostic factors.
Consequently, we propose a 2-stage randomization. Patients are initially randomized to a treatment and stratified by key prognostic factors as per usual. A subsequent randomization then takes place, in which patients are allocated to learn or confirm set, with randomization once again stratified by the same key prognostic factors and additionally by treatment. The randomization scheme, along with the key aspects of the 2-stage design discussed in this paper, should be described prospectively in the study protocol and/or the statistical analysis plan to ensure data integrity and avoid any ambiguity in implementation after the database lock.
5 DISCUSSION
In this simulation study, we evaluated the impact of the following factors: (1) allocation between learn and confirm stages; (2) alpha allocation for the all-comers test and the M+ subgroup test; and (3) the number of candidate biomarkers considered in the learn stage. There are, however, other considerations that may influence the results. For example, the impact of multiplicity control at the learn stage (eg, strong vs weak control); leveraging advanced strategies, such as graphical testing by Bretz et al[8] to allow α-propagation in a more sophisticated manner; the choice of biomarker/subgroup identification methods at the learn stage (eg, advanced tools such as the novel recursive partitioning procedure SIDES[9] and the tree-based method GUIDE[10]) to more effectively identify useful and relevant subgroups (with key considerations of multiplicity adjusted P values and bias-corrected estimates of effect sizes); correlations among biomarkers; whether multiple markers are allowed to enter into the confirm stage and the associated multiplicity issues (we only allowed for a single marker); and so on. These factors may be investigated in future studies.
Even though we used a fixed cutoff in the simulations, identifying the sensitive subpopulation should never be a hurdle when implementing the 2-stage design. First, as discussed in Beckman, Clark, and Chen,[11] support for a predictive biomarker hypothesis can be obtained by numerous approaches. Preclinical experiments (in vitro and in vivo) with an exploratory phase of biomarker development and/or early-phase clinical development (eg, phase II proof of concept) could provide insights into the assay/signature for the most likely responders to a new agent in a larger phase III trial. Investigators therefore should have a pretty good estimate of the cutoff value to be confirmed. Second, for investigational oncological agents for the same indication targeting the same biological pathway, it is sensible to use existing study results, especially those with successful retrospective biomarker identifications and biological rationales. One example is the phase III First-Line ErbituX in lung cancer (FLEX) study (Pirker et al[12]) where high-level expression of the epidermal growth factor receptor protein defined by tumor immunohistochemistry H-score ≥200 was a biomarker to predict overall survival benefit from adding cetuximab to first-line chemotherapy in patients with advanced non-small cell lung cancer (NSCLC). The same cutoff value was prospectively specified in the SQUIRE trial (Thatcher et al[13]). Third, “data-driven” methods for identifying a cutoff are abundant, such as fitting a generalized linear model[4] or using a variety of machine learning techniques (Breiman[14]).
Appropriate 2-stage randomization algorithms should be used to ensure comparability and balance both between and within the 2 patient subgroups used for learning and confirming (see Section 4). The randomization process must be prespecified and well documented.
Relevant adaptive designs which incorporate both overall and subgroup tests are rich in the literature, and we highlight two of them here: Freidlin, Jiang, and Simon[15] have proposed an extension to the ASD, the cross-validated ASD, which is generally more powerful. In the cross-validated ASD, the full study sample is randomly divided into K equally sized, mutually exclusive sets of patients, and then a “sensitive subset” is identified within each set of patients using a “signature” developed based on the remaining patients. The sensitive subsets are then pooled, and the treatment effect tested in this pooled subset. Jenkins, Stone, and Jennison[16] is another important work particularly for oncology studies as their proposed adaptive seamless design allows investigation of a co-primary population based on intermediate different (but correlated) time-to-event endpoints. This design is flexible and efficient as the decision rule applied in stage 1 based on progression-free survival (PFS) could end up with possible outcomes of studying coprimary population, or full population, or subgroup only at stage 2. Stopping the trial for futility is another outcome. The final decision based on overall survival (OS) (given the trial continues to stage 2) is made by including all patients with appropriate integration of data from both stages. When a predictive biomarker is already identified, this design offers potential power gains.
We have studied the ASD, as the ASD has an important practical advantage compared with other methods, namely, that at the end of the learn stage certain practical steps can be taken by the sponsor, including: assay refinement, regulatory meetings, or identification of a partner to develop a diagnostic kit. This would enable the sponsor to develop a fully validated assay and test the final samples using a market-ready version of the diagnostic assay, with an analysis plan that has gained regulatory approval. Furthermore, to ensure integrity in a regulatory framework, it would enable a well-documented implementation plan to be developed, which would include confirming that the assay samples for the confirmatory set were not only not available to the sponsor but had not even been assayed prior to locking down the statistical analysis plan for the confirmatory set. It should be noted that every adaptive design comes with pros and cons not only for statistical considerations such as controlled Type I error rates and powers but also for practical considerations such as operational complexity and cost effectiveness. As we discussed in this manuscript for the 2-stage ASD, choosing the most appropriate phase III design strategy demands a comprehensive understanding of each method, with necessary simulation studies performed in advance, appropriate documentations of the entire trial flow ready and evaluations of potential regulatory risks/challenges.
To better facilitate implementing each “study-specific” simulation based on existing information, we developed an accompanying R package called simASD (open-source) to create plots like Figures 1 to 3 in the manuscript. End users can easily modify the programs to fit their particular simulation needs. A vignette with detailed descriptions of preparing a simulated mega dataset and 2 simple functions calls for the power evaluations is available as part of the package.
6 CONCLUSIONS
In cancer drug development, there may often be set of biomarkers that may modify the effect of the drug and may even be predictive of which patients do and do not benefit treatment. The biomarkers can be based on prior preclinical models, biological understanding of the mechanism, target and pathways, and early clinical data (despite generally limited sample sizes in phase I/II studies). There is often uncertainty surrounding identification of predictive biomarkers and their cutoffs (if continuous), and therefore, the evidence to support a stratified or restricted-population phase III trial is limited. Consequently, further exploration of potentially predictive biomarkers tends to be exploratory, especially if the treatment effect is not significant (in the intent-to-treat population). If these efforts identify a promising biomarker or subgroup, investigators have to conduct new phase II/III studies to seek an approved label claim.
From an efficiency perspective, it is of great interest to use a novel 2-stage design that prospectively combines the identification of potential subgroups with the test for overall treatment effect within a single clinical trial. The ASD originally proposed by Freidlin and Simon[4] is one such effort.
A deeper understanding of the design and operating characteristics of learn-confirm designs, including enabling trial sponsor to thoroughly evaluate at the trial planning stage the potential benefits of using a 2-stage design can be made with greater confidence. In this article, we performed extensive simulations under realistic assumptions to evaluate some key aspects such as the optimal sample size allocations between the 2 stages. The results here show how could be used to guide future study designs which aim to test not only among the overall population but also in a sensitive subgroup within a single randomized trial.
ACKNOWLEDGEMENTS
The author would like to thank Jonathan Denne and Hollins Showalter for critical reviews of the manuscript and Adarsh Joshi and Eric Nantz for helpful discussions.