An alternative method to analyse the biomarker‐strategy design

Recent developments in genomics and proteomics enable the discovery of biomarkers that allow identification of subgroups of patients responding well to a treatment. One currently used clinical trial design incorporating a predictive biomarker is the so‐called biomarker strategy design (or marker‐based strategy design). Conventionally, the results from this design are analysed by comparing the mean of the biomarker‐led arm with the mean of the randomised arm. Several problems regarding the analysis of the data obtained from this design have been identified in the literature. In this paper, we show how these problems can be resolved if the sample sizes in the subgroups fulfil the specified orthogonality condition. We also propose a different analysis strategy that allows definition of test statistics for the biomarker‐by‐treatment interaction effect as well as for the classical treatment effect and the biomarker effect. We derive equations for the sample size calculation for the case of perfect and imperfect biomarker assays. We also show that the often used 1:1 randomisation does not necessarily lead to the smallest sample size. In addition, we provide point estimators and confidence intervals for the treatment effects in the subgroups. Application of our method is illustrated using a real data example.


INTRODUCTION
The focus of modern medicine has shifted from broad spectrum treatments to targeted therapeutics leading to new challenges for the design and analysis of clinical trials. Recent developments in genomics and proteomics enable the discovery of biomarkers that allow identification of subgroups of patients responding well to a treatment. Often, little is known about this subset of patients until well into large-scale clinical trials. 1 The impact of genomic variability is often assessed in a biomarker-based design. The general question when using any of these designs is whether different treatments should be recommended for different subgroups of patients. Establishing clinical relevance of a biomarker test for guiding therapy decisions requires demonstrating that it can classify patients into distinct subgroups with different recommended managements. 2 Several clinical trial designs using a biomarker to identify subgroups of patients likely to respond to a treatment have been proposed in the literature. The most commonly used designs are the enrichment design and the biomarker stratified design. 3,4 Another currently used design incorporating a predictive biomarker is the so-called biomarker strategy design (or marker-based strategy design). 2,5,6 Within this design, patients are randomised either to have their treatment based on the biomarker (ie, biomarker-positive patients receive a new treatment while biomarker-negative patients receive the standard treatment) or to be randomly assigned to treatment T or control group C (see Figure 1).

FIGURE 1 Biomarker-strategy/marker-based design
An alternative variant of the biomarker strategy design would not randomise patients in the randomised arm but treat everyone in this arm with the control treatment. 2,3,7 Although this design is a special case of the design described above (the probability to be randomised into the treatment arm is set to 0), this special case is more frequently used. 8 The main criticism here is that a huge proportion of the patients receive the same treatment. 9 Furthermore, given that all patients in the randomised arm receive the control treatment, this variant of the biomarker-strategy design cannot establish whether the new treatment would be beneficial for biomarker-negative patients as they all receive the control treatment. 3 Conventionally, the results of a biomarker-strategy design are analysed comparing the biomarker-led arm (arm BM) to the randomised arm (arm R) (see Figure 1). The efficiency of this analysis has been investigated. 8,10 However, all criticisms are focusing on special cases of the biomarker-strategy design: Firstly, they all assume that randomisation to the biomarker-led and the randomised arm is equal. Secondly, Hoering et al 8 and Young et al 10 both considered the special case of "no randomisation" in the randomised arm, which means that all patients in that arm receive the control treatment. Young et al also consider the case of equal randomisation within the randomised arm. While equal randomisation might be frequently used, it is not clear whether this minimises the sample size of the biomarker-strategy design. Furthermore, Young et al investigate the efficiency of the design either if there is no treatment effect in the disease-negative patients or if the treatment effect in the disease negative patients is in the same direction as the treatment effect for the positive patients but only half as large.
We hence extend the previous work in the following ways: (1) We investigate the efficiency of the design if the treatment effect in the disease-negative patients is in the opposite direction of the treatment effect for the biomarker-positive patients and (2) we investigate optimal randomisation rules for the design in order to minimise the sample size. Furthermore, we propose a different way to analyse the data obtained from the biomarker-strategy design. The "traditional" analysis does not exploit all information that can be obtained from the data. Rücker 11 proposed a two-stage randomised clinical trial design, which in principle corresponds to the biomarker strategy design with the only difference that patients in the "non-randomised arm" (ie, the biomarker-led arm) choose the treatment they prefer. However, she suggests an analysis method that allows testing of the overall treatment effect, the self-selection effect and the preference effect. Walter et al 12 derive optimal randomisation rules for the abovementioned design by Rücker. Their focus lies only on optimising the first randomisation while they assume equal allocation to treatment and control in the randomised arm. Turner et al 13 give a sample size equation for the design proposed by Rücker. However, like Walter et al, they also assume equal randomisation in the randomised arm. They also assume equal variances across all arms.
The aim of this paper is to adopt the analysis method proposed by Rücker and extend it to the biomarker strategy design by accounting for possible misclassifications using an imperfect assay. We provide equations to calculate the sample size as well as point estimators and confidence intervals for the treatment effects in the biomarker-positive and biomarker-negative subgroups.

MOTIVATING EXAMPLE
Brusselle et al 14 report the results from the AZISAST trial, a multicentre randomised double-blind placebo-controlled trial. Patients with exacerbation-prone severe asthma received low-dose azithromycin or placebo as add-on treatment to a combination therapy of inhaled corticosteroids and long-acting 2 agonists for 6 months. The primary outcome was the rate of severe exacerbations and the lower respiratory tract infections (LRTI) requiring treatment with antibiotics during the 26-week treatment phase, and one of the secondary endpoints was the forced expiratory volume in 1 second (FEV 1 ). For the primary endpoint as well as the abovementioned secondary endpoint, no effects were found for the overall population. However, opposite effects were found for the primary endpoint for patients with either eosinophilic or non-eosinophilic severe asthma. For the secondary outcome FEV 1 , they report a baseline value of 80.1 (standard deviation SD 21.9) for the azithromycin group and a baseline value of 84.8 (20.7) for the placebo group. They also report a mean difference (between baseline and 26 weeks) of −0.02 for the azithromycin group and a mean difference of −0.90 for the placebo group. They do not report the values for the two subgroups (eosinophilic or non-eosinophilic severe asthma), so for our illustrations, we therefore assume the following values: T + = 90, T − = 70, C + = 75, and C − = 95. Patients with eosinophilic severe asthma are considered "positive." Furthermore, we assume T + = T − = C + = C − = 20. For a prevalence of p = 0.5 for eosinophilic asthma, the effect in the treatment group is p T + + (1 − p) T − = 80; for the control group, the effect is p C + + (1 − p) C − = 85; and the overall treatment effect is which roughly reflects the effects observed in the AZISAST trial.

NOTATION
Let N denote the total sample size (assumed to be fixed), r 1 denote the fraction of patients randomised to the biomarker-led arm, and let n BM = Nr 1 denote the number of patients in the biomarker-led arm. Furthermore, let n R = N(1 − r 1 ) denote the number of patients in the randomised arm. Let r 2 denote the fraction of patients within the randomised arm that receive the experimental treatment and n T = N(1 − r 1 )r 2 denote the sample size for this arm. Furthermore, let n C = N(1 − r 1 )(1 − r 2 ) denote the sample size for the control arm of the randomised arm. We assume that a block randomisation is used so that the sample sizes n BM , n R , n T , and n C are fixed.
Within the biomarker-led arm, patients have their biomarker assessed. Due to cost, ethical, or administrative reasons, an imperfect assay is used to determine the true biomarker status. Let p denote the prevalence of the true biomarker status and let t and s denote the sensitivity and specificity of the biomarker assay used. Without loss of generality, we assume that patients with an observed positive biomarker status receive the experimental treatment while patients with an observed negative biomarker status receive the control treatment. Let n T ⊕ (n C ⊖ ) denote the number of patients with an observed positive (negative) biomarker status. The sample sizes follow binomial distributions with n T ⊕ ∼ Binomial(Nr 1 , p ⊕ ) and The primary endpoint of interest is denoted with Y and follows a normal distribution with mean T + and variance 2 T + for truly biomarker-positive patients receiving the experimental treatment, mean T − and variance 2 T − for truly biomarker-negative patients receiving the experimental treatment, mean C + and variance 2 C + for truly biomarker-positive patients receiving the control treatment, and mean C − and variance 2 C − for truly biomarker-negative patients receiving the control treatment. Table 1 gives an overview of some of the notation used.

TRADITIONAL ANALYSIS
The traditional analysis of the biomarker-strategy design is to compare the mean of the biomarker-led arm BM with the mean of the randomised arm R . 15 The null hypotheses states that H TA 0 : BM = R while the alternative hypothesis states that H TA 1 : BM ≠ R .
Let Z BM and Z R be defined as Z BM = . A test statistic for the hypothesis above is then given by The expected value and variance of the test statistic as well as an estimator for the variance can be found in Appendix A.

Criticism
Although the biomarker strategy design is sometimes regarded as the gold standard, there is also a lot of criticism regarding the trial design and the analysis of the data. The main problem with the traditional analysis is that it does not distinguish between a situation where there is only an overall treatment effect (but no treatment-by-biomarker interaction effect), a situation where there is only an interaction effect (but no treatment effect), and a situation where there is both a treatment and an interaction effect. This "inability" leads to the following problems: • Problem 1 "No interaction effect": As mentioned above, the traditional analysis cannot distinguish between the treatment and the interaction effect. If there is no interaction effect, we have T In this case, the expected value of the test statistic for the traditional analysis is E . So, as long as tp + (1 − p)(1 − s) − r 2 ≠ 0 and T• − C• ≠ 0, we observe a difference between the mean of the biomarker-led arm and the randomised arm even if there is no treatment-by-biomarker interaction effect. • Problem 2 "Biomarker assay not predictive": As Freidlin, McShane, and Korn 2 noted before for a design with r 2 = 0, it is still possible to observe a difference between the means even if the biomarker assay is not predictive, ie, , which is the same as (0.5 − r 2 ) times the "overall treatment effect." Hence, we would still observe a difference between the biomarker-led arm and the control arm (randomised arm) as long as there is an "overall treatment effect" and 0.5 − r 2 ≠ 0. If the observed difference is positive, we would recommend to use the assay to inform about the treatment a patient should receive although the assay is not predictive. • Problem 3 "Arbitrary randomisation": The expected value of the test statistic for the traditional analysis depends on the arbitrary randomisation ratio

Orthogonality condition
In the previous section, we have seen that the traditional analysis cannot distinguish between the treatment and the interaction effect. Under Problem 1, we show that the expected value in case there is no interaction effect is One way of solving this problem is to set r 2 = p ⊕ . This ensures that the traditional analysis is now a test for the interaction effect only and that if there is no interaction effect, the expected value of the test statistic is indeed 0 (the proof can be found in Appendix A). Hence, setting r 2 = p T ⊕ is a requirement in order to be able to interpret the results of the traditional analysis.
We call this the "orthogonality condition" for the traditional analysis as it is similar to the "orthogonality condition" in a two-way ANOVA, ie, sample sizes in the different cells have to be proportional (in our case, n T ⊕ ∕n T = n C ⊖ ∕n C ) in order to distinguish between the different effects. 16 This will also solve Problem 2 as setting Hence, we do not observe a difference between the means of the biomarker-led arm and the randomised arm if the biomarker assay is of no use at all. We also solve Problem 3, as r 2 is now fixed to p ⊕ .
In general, the expected difference between the mean of the biomarker-led arm and the randomised arm can now only be zero if (a) p = 0 or p = 1, (b) s + t = 1, or (c) T + − C + − ( T − − C − ) = 0. Case (a) refers to a situation where there are no subgroups to distinguish as every patients is either truly positive or truly negative, case (b) refers to a situation where the biomarker assay is of no use to distinguish between truly positive and truly negative patients, and case (c) refers to a situation where there is no treatment-by-biomarker interaction effect. While setting r 2 to p ⊕ fixes the problems regarding the traditional analysis, in practice, the value of p ⊕ is often unknown. Hence, a different method that is robust to the choice of r 2 is preferable. Furthermore, there might be other reasons (ethical or financial) why a different value of r 2 might be chosen.

ALTERNATIVE ANALYSIS METHOD
In order to overcome the abovementioned disadvantage of the traditional analysis method, we propose a different way to analyse the data obtained from a biomarker-strategy design by defining three test statistics that clearly distinguish between the treatment effect, the (so-called) biomarker effect, and the interaction effect.

Hypotheses
We define the following three effects: (1) the treatment effect, (2) the biomarker effect, and (3) the interaction effect with corresponding null hypotheses

Test statistics for the treatment, biomarker, and interaction effect 5.2.1 Test statistic for the treatment effect
The two means T and C can be estimated directly from the design by 1 respectively. Hence, we can define a test statistic for the treatment effect as follows: The expected value and variance of the test statistic as well as an estimator for the variance are given in Appendix B.

Test statistics for the biomarker and the interaction effect
In the following, we derive test statistics for the biomarker and the interaction effect. Note that the null hypotheses for the biomarker and the interaction effect depend on the means T + , C + , T − , and C − , which cannot be estimated directly from the design. However, it is possible to rewrite the null hypotheses in such a way that only parameters that can be estimated directly from the data are involved. Let T ⊕ denote the true mean of the biomarker-led treatment arm, and let C ⊖ denote the true mean of the biomarker-led control arm (see Table 1). It can be shown that in the case that there is no biomarker (interaction) effect, the following expressions are true: Now, let Z T and Z C be defined as follows: It can be shown that the left-hand side of Equations 6 and 7 can be estimated by (Z T − Z C ) ∕ (Nr 1 ) and (Z T + Z C ) ∕ (Nr 1 ), respectively. Hence, a test statistic for the biomarker effect is given by and for the interaction effect by The expected values and variances as well as estimators for the variances can be found in Appendix C.

Multiple testing
In cases where little is known about the new treatment under investigation and/or the biomarker used, it might be desirable to test more than one hypotheses within the same trial. For example, we might be interested in simultaneously testing the treatment and the interaction effect. Let COV T,B , COV T,I , and COV B,I denote the covariances between the test statistics for the treatment and the biomarker effect (T,B), the treatment and the interaction effect (T,I), and the biomarker and the interaction effect (B,I). The expected values of the covariances are given by Let T denote the vector containing the three test statistics for the treatment, biomarker, and interaction effect with , Σ = .
Functions like qmvnorm from the R-Package mvtnorm or the Mata function ghk() from Stata can be used in order to find the adjusted critical values to control the overall type I error rate. However, in the following, we focus on testing the interaction effect only, so no adjustment is made.

Sample size calculation
In the following, we give formulae to approximately calculate the sample sizes to test the different hypotheses given above. To simplify the notation, let T , C , T ⊕ , T ⊖ , A, B, and C be defined as follows: The following equations give the sample sizes required to detect a certain effect for the traditional analysis (N TR ), the treatment effect (N T ), the biomarker effect (N B ), and the interaction effect (N I ) based on a two two-sided test with type I error and power 1 − : It can be seen that all sample sizes depend on the randomisation ratios r 1 and r 2 . Optimal solutions can be found using a grid search.

Point estimators and confidence intervals
The test statistic for the interaction effect does not provide an answer to whether the interaction is quantitative (i.e., the treatment effect for biomarker-positive and biomarker-negative patients points in the same direction but has a different magnitude) or qualitative (i.e., the treatment effects for the two subgroups point in different directions). Hence, we derive point estimators and confidence intervals for the treatment effects T + − C + and T − − C − for the biomarker-positive and biomarker-negative patients, respectively. In the following, we assume that the true values of the prevalence p, the sensitivity t, and the specificity s are known (or can be estimated from a different trial). The treatment effects for biomarker-positive and biomarker-negative patients can be estimated bŷ Assuming that the estimators approximately follow a normal distribution, (1 − )% confidence intervals are given by [̂T withV AR [̂T The derivations can be found in the Supporting Information where we also provide the covariance between the two estimators allowing for the construction of joint confidence regions.

RESULTS
In order to be able to compare the traditional and the alternative analysis method, we set r 2 = p ⊕ . As shown in Section 4.2, this will ensure that the traditional analysis method provides a test statistic for the interaction effect only. On the basis of the results for the AZISAST trial (see Section 2), we calculated the sample sizes to test the interaction effect using the traditional and the alternative analysis method for different values of the prevalence p (from 0 to 1 in steps of 0.01) and the sensitivity t and specificity s (from 0.5 to 1 in steps of 0.1) for a two-sided type I error of = 0.05 and a power of 80%. In the following, we focus on testing the interaction effect only (instead of testing the treatment, the biomarker, and the interaction effect simultaneously), so no adjustment for the significance level is made. For the traditional analysis, r 2 is set to p ⊕ (as explained in Section 4.2). In order to find the optimal value for r 1 , we calculated the sample sizes for values of r 1 from 0 to 1 in steps of 0.01. The randomisation ratio r 1 is then chosen so that the resulting sample size is minimised. Obviously, choosing smaller increments will yield a more accurate value for r 1 . For the alternative analysis method, we used a grid search over r 1 and r 2 (from 0 to 1 in steps of 0.01). For each combination of r 1 and r 2 , we calculated the resulting sample sizes and chose the combination that minimises the sample size for given values of p, t, and s. In order to verify the obtained sample sizes, we simulated 10 000 data sets and estimated the resulting type I error and power for different scenarios (see Appendix D).

Results for the traditional analysis to test the interaction effect
The left-hand side of Figure 2 shows the minimal sample sizes needed for the traditional analysis depending on the prevalence p, the sensitivity t, and the specificity s for the AZISAST trial. The black solid line shows the sample sizes for a perfect biomarker assay, the grey solid line for a biomarker assay with t = s = 0.9, and the black dashed line for a biomarker assay with t = s = 0.8. The labels show the values for the randomisation ratios r 1 and r 2 with the value of r 1 being the optimal value for r 2 = p ⊕ .
As expected, we see that the sample size for the perfect assay is smaller than the sample sizes for an imperfect assay with the sample size getting larger the smaller the values for t and s. We also see that the sample size increases with smaller and

FIGURE 2
Minimal sample sizes to test the interaction effect using the traditional and alternative analysis method depending on the prevalence p, the sensitivity t, the specificity s, and the randomisation ratios r 1 and r 2 larger values for the prevalence p. The sample size for the traditional analysis is smallest for a perfect assay, a prevalence of p = 0.5, and randomisation ratios of r 1 = 0.47 and r 2 = 0.5. In this case, only 142 patients would be needed. If the prevalence is 0.15, the sample size increases to 516 for the perfect biomarker assay.
For an imperfect assay with t = s = 0.9, the sample size is 231 for a prevalence of 0.5, and hence, in comparison to a perfect assay, we already need nearly 90 patients more. For a prevalence of 0.15, the sample size increases to 853, and therefore, nearly 340 more patients are needed in comparison with the perfect assay. The sample sizes are even larger when the sensitivity and specificity are 0.8. In this case, 423 patients are needed if p = 0.5 and 1580 if p = 0.15. Compared with the case of a perfect assay, the sample sizes approximately triple.
In general, it should be noted that the optimal value for r 1 is not necessarily 0.5 but varies between 0.47 and 0.51. However, for a prevalence between 0.05 and 0.95, the difference in sample sizes for the optimal value of r 1 and r 1 = 0.5 is less than one patient. Therefore, in most cases, r 1 can be set to 0.5 without a noticeable change in the sample size.

Results for the alternative method to test the interaction effect
In the following, we show the results for the sample sizes for testing the interaction effect based on our proposed analysis method. The right-hand side of Figure 2 shows the resulting sample sizes for the interaction effect. As before, the black solid line shows the sample sizes for the perfect biomarker assay, the grey solid line for an assay with t = s = 0.9, and the black dashed line for an assay with t = s = 0.8. Labels show the optimal values for r 1 and r 2 so that the sample size is minimised for given values of p, t, and s.
Again, we see that the sample sizes are smaller for prevalences around 0.5 and increase towards smaller and larger values. For example, for a perfect assay and a prevalence of 0.5, the minimal sample size for the interaction effect is 142. It increases to 250 if the prevalence is 0.25 and to 530 if the prevalence is only 0.15. We also see that (as expected) larger sample sizes are needed if the assay is not perfect. For example, for an imperfect assay with t = s = 0.9, the minimal sample sizes are 231 (for p = 0.5), 400 (for p = 0.25), and 836 (for p = 0.15). For an imperfect assay with t = s = 0.8, the following sample sizes result 422 (for p = 0.5), 722 (for p = 0.25), and 1498 (for p = 0.15).
As for the traditional analysis, we note that the optimal value for the randomisation ratio r 2 is r 2 = p ⊕ . However, while this is a requirement for the traditional analysis in order for the results to be interpretable (see Section 4.2), for the interaction effect setting r 2 to p ⊕ minimises the sample size but is not necessary in order to interpret the results of the test statistic.
We also see that the optimal value for r 1 is not necessarily 0.5 but ranges between 0.47 and 0.5. Again, the differences between the minimal sample size and the one obtained for r 1 = 0.5 is less than one patient as long as the second randomisation ratio r 2 is chosen so that the sample size is minimised given r 1 . The situation changes noticeably if both randomisation rules are set to 0.5, which is frequently done. Let N 0.5,0.5 denote the sample size for the interaction effect for r 1 = r 2 = 0.5 and let N min denote the minimal sample size. Figure 3 shows the absolute difference in the sample sizes (N 0.5,0.5 − N min , left-hand side) and the relative differences (N 0.5,0.5 ∕N min , right-hand side). As we can see, the absolute differences vary between 0 and approximately 130 (for a prevalence between 0.15 and 0.85). The absolute difference is even larger for smaller (and higher) values of the prevalence. While we see that the absolute difference of the sample sizes does not depend on the sensitivity and specificity of the biomarker, we see that the relative difference does (see right-hand side of Figure 3). The highest relative differences occur for the perfect biomarker assay (as this has the smallest minimal sample size). Figure 4 shows the absolute differences in sample size for the traditional analysis and the alternative analysis for the interaction effect (left-hand side) as well as the relative differences in sample size in percent (right-hand side) . As we can see, the sample size for the traditional analysis is often higher than the sample size for the interaction effect (except for a perfect biomarker assay where the sample size for the traditional design is always lower than the sample size for the interaction effect). The absolute difference in the sample size might be substantial: for example, for t = s = 0.7 and p = 0.15, the absolute difference in sample size is 268. The relative increase in the sample size is about 8% (from N I = 3388 to N TR = 3656). We can also see that in cases where the sample size for the traditional analysis is smaller than the sample size for the interaction effect, the increase in sample size is less than 3%. For example, for t = s = 1 and p = 0.15, the sample size increases from N TR = 516 to N I = 530.

FIGURE 3
Difference in sample sizes to test the interaction effect using the alternative analysis method with r 1 = r 2 = 0.5 or the optimal randomisation ratios depending on the prevalence p, the sensitivity t, the specificity s, and the randomisation ratios r 1 and r 2 FIGURE 4 Differences in sample sizes to test the interaction effect using the traditional and alternative analysis method depending on the prevalence p, the sensitivity t, and the specificity s

CONCLUSION
This paper investigates the performance of the traditional analysis of the biomarker-strategy design. We derive optimal randomisation rules in order to minimise the sample size for the traditional analysis. We also propose a different analysis method for the data obtained from such a design and explore the properties of this method.
The traditional analysis has been criticised many times. 2,9,17,18 The main problem is that the traditional analysis does not distinguish between the treatment and the interaction effect. Hence, it is possible to observe no difference between the mean of the biomarker-led arm and the randomised arm even if there is an interaction effect or to observe a difference between the means even if there is no interaction effect. We show that if the sample sizes in the four subgroups of the biomarker-strategy design fulfil the "orthogonality condition" (ie, are proportional to each other), the abovementioned problems can be solved. However, while the "orthogonality condition" ensures that the traditional analysis provides a test statistic for the interaction effect only, the value of p ⊕ is often not known as it depends on the true values for the sensitivity, the specificity, and the prevalence. These values are often unknown at the planning stage of a trial.
To overcome the disadvantages of the traditional analysis, we propose a different analysis method based on the work of Rücker. 11 We define three different effects than can be measured within the biomarker-strategy design: the treatment effect, the biomarker effect, and the interaction effect. This ensures that our analysis can clearly distinguish between the treatment effect and the interaction effect, regardless of the randomisation rules used. We derive test statistics and show how the sample sizes for the three effects can be calculated. Optimal randomisation rules are derived in order to minimise the sample sizes. We also provide point estimators and confidence intervals for the treatment effects in the two subgroups.
Furthermore, we show that if the "orthogonality condition" is fulfilled, the sample size for the interaction effect (based on the alternative analysis method) is often smaller than the sample size for the traditional analysis. In cases where the sample size for the alternative analysis method is larger than the sample size for the traditional analysis, the difference is less than 3% of the sample size and therefore often negligible. In general, we therefore recommend to use the alternative analysis method instead of the traditional analysis method.
It should be noted that the biomarker-strategy design is less efficient than the biomarker-stratified design. Shih and Lin 19 recently published an evaluation of the relative efficiency comparing the variances of the two designs with each other. They conclude that the biomarker-strategy design is less efficient than the biomarker-stratified design, which comes as no surprise given that the biomarker-stratified design contains more information. While the qualitative conclusion of Shih and Lin is correct (ie, the biomarker-stratified design is more efficient than the biomarker-strategy design), their quantitative results are based on setting r 2 equal to 0.5. As we have shown, this choice will not lead to the most efficient design (see Section 6.2). Hence, it remains to be seen how much more efficient the biomarker-stratified design is as compared to the biomarker-strategy design if optimal randomisation rules are used.

APPENDIX A: TRADITIONAL ANALYSIS
The expected value and variance of the test statistic for the traditional analysis are given by The variance can be estimated bŷ In the following, we show that by setting r 2 = p ⊕ , the traditional analysis method provides a test statistic for the interaction effect. According to Equation A1, the expected value is given by ) . (A4)

APPENDIX B: TREATMENT EFFECT
The expected value and true variance of the test statistic for the treatment effect are given by The variance can be estimated bŷ .

APPENDIX C: BIOMARKER AND INTERACTION EFFECT
The expected values are given by The true variances are given by and the true covariance by The variances and covariances of Z T and Z C can be estimated as follows: The derivations can be found in the Supporting Information. We anticipate the test statistics to approximately follow a normal distribution.

APPENDIX D: SIMULATION RESULTS FOR THE TYPE I ERROR AND POWER
In order to verify the sample sizes obtained in Section 6, we simulated 10 000 data sets and estimated the resulting type I error and power (Table D1). The power was estimated using the same values as described in Sections 2 and 6. For the type I error, the means were set to T + = 90, T − = 70, C − = 75, and C − = 55 (ie, assuming that there is no interaction effect).