Improving sample size recalculation in adaptive clinical trials by resampling

Sample size calculations in clinical trials need to be based on profound parameter assumptions. Wrong parameter choices may lead to too small or too high sample sizes and can have severe ethical and economical consequences. Adaptive group sequential study designs are one solution to deal with planning uncertainties. Here, the sample size can be updated during an ongoing trial based on the observed interim effect. However, the observed interim effect is a random variable and thus does not necessarily correspond to the true effect. One way of dealing with the uncertainty related to this random variable is to include resampling elements in the recalculation strategy. In this paper, we focus on clinical trials with a normally distributed endpoint. We consider resampling of the observed interim test statistic and apply this principle to several established sample size recalculation approaches. The resulting recalculation rules are smoother than the original ones and thus the variability in sample size is lower. In particular, we found that some resampling approaches mimic a group sequential design. In general, incorporating resampling of the interim test statistic in existing sample size recalculation rules results in a substantial performance improvement with respect to a recently published conditional performance score.

with a sample size adaptation. There exist different possibilities on how to adapt the sample size based on the interim results. Various advantages of adaptive designs are described in the recently published draft guideline by the Food and Drug Administration on adaptive designs, 1 yet no explicit advice is given on the choice of sample size recalculation rules. Likewise, the European Medicine Agency gives no clear advice on how to select a certain sample size recalculation rule in their reflection paper on adaptive designs. 2 The most common approach for adapting the sample size is based on conditional power arguments. 3 In this paper, the conditional power describes the probability of correctly rejecting the null hypothesis at the final analysis, given the observed interim data and the assumed true effect. Sample size recalculation rules based on conditional power arguments go back to Proschan and Hunsberger 4 as well as Lehmacher and Wassmer 5 in the 1990s. Nowadays, there exists a great variety of extended approaches based on these initial methods, [6][7][8] Another strategy is to consider the sample size update as the solution of an optimization problem in terms of a selected set of design features (e.g., conditional power, total sample size). 9,10 Although there thus exists a high number of strategies for recalculation, there are still major problems related to sample size recalculation. Those problems are, for example, not meeting the target power, large recalculated sample sizes and a high variability in the recalculated sample size. 11,12 Particularly, Bauer et al. 3 noted that the assumption that the observed interim effect is the true effect "may lead to a highly variable distribution of the conditional power resulting in highly variable sample sizes." The uncertainty of the interim result is indeed one of the major problems with sample size recalculation. One way to tackle this problem is to use a Bayesian approach. Here, a certain probability distribution for the true effect is assumed. More precisely, the sample size calculation can be based on the Bayesian predictive power 13 instead of the conditional power. The underlying idea is to determine an average conditional power based on a prior distribution assumption for the true effect. The foundation for this approach was laid by Spiegelhalter et al. 14,15 Another option to take account of the whole distribution of the interim test statistic is to implement resampling or bootstrapping approaches. For many applications it is known that the interim test statistic is approximately normally distributed where the expected value is just the test statistic for the true standardized effect and the variance equals 1. Naturally, the observed interim test statistic corresponds to the expectation of its bootstrapping distribution. By inserting the transformed interim test statistic as the expected value of the distribution, this allows us to draw random samples from this distribution directly instead of bootstrapping from the complete data set. This clearly reduces the computational effort. For this reason, we will use the term "resampling" in the following, although the procedure mimics a bootstrapping approach.
Preliminary work on the combination of bootstrapping with sample size recalculation was conducted by Hade et al. 16 For a blinded survival data setting, they proposed to use additional information from prior trials just before the ending of the planned accrual time. By repeatedly estimating the baseline survival function, the sample size was re-estimated. They determined the 80%-quantile of the bootstrapped distribution as the final sample size.
In this article, we propose to combine a resampling approach with a variety of existing sample size recalculation rules. The resampling approach can be applied to different types of endpoints with approximately normally distributed test statistics, however, we focus on a normally distributed endpoint for the sake of simplicity. By this approach, we treat the interim test statistic as a random variable. The idea is to repeatedly draw test statistics from the interim test statistic distribution, which is approximated by plugging in the observed interim test statistic as expectation. For each randomly drawn test statistic, we recalculate the sample size according to an established sample size recalculation rule. All these recalculated sample sizes are then combined into a single updated sample size for the second stage. We evaluate the resampling approach by means of performance characteristics from the conditional perspective. This refers to evaluating the sample size and power under the assumption of already knowing that the study is not stopped at interim, thus actually having the possibility to recalculate the sample size.
The article is structured as follows. In the following methods section, we describe the test problem, common recalculation strategies, and the new resampling approach that accounts for the uncertainty of the observed interim effect. Moreover, we give a short overview of the performance criteria used to qualify "good" recalculation strategies. The third section displays the simulation to evaluate the performance of the existing and new approach. The results are compared and discussed in detail in the last section. General recommendations for further applications are deduced.

| The test problem
We consider a randomized, controlled, two-armed clinical trial. The n observations of the intervention group I and control group C follow a normal distribution with means μ I and μ C and common variance σ 2 , that is Throughout this work, we investigate the one-sided superiority test problem hence referring to a setting where large values of the endpoint are considered as favorable. We consider an adaptive group sequential design with two stages, which is the simplest and most frequently applied adaptive group sequential design. 3 Hence, we have to define two independent test statistics, where i {1, 2} refers to the stages, X I i and X C i to the respective means, S pooled,i to the pooled standard deviation as well as n i to the sample size per group in stage i with n 1 + n 2 = n. Note that T 1 exclusively includes the data of the first and T 2 only those data of the second stage, both following approximately a normal distribution.
The trial is continued with the second stage if the interim test statistic T 1 falls within the so called recalculation area (RA) given by the interval q 1− α 0 ; q 1 − α 1 Â Á , where α 0 refers to a futility stopping bound for the one-sided p-value of stage one, α 1 to the local one-sided significance level and q indicates the respective quantiles of the normal distribution. Hence, the trial is stopped after the first stage with an early rejection of the null hypothesis if T 1 ≥ q 1 − α 1 , or with acceptance of the null hypothesis if T 1 < q 1 − α 0 .
All observed data over the two stages are combined by means of the inverse normal combination test 11 represented by the combined test statistic consisting of the two stochastically independent test statistics T 1 and T 2 , where we choose the weights w 1 = ffiffiffiffiffi n 1 p and w 2 = ffiffiffiffiffi n 2 p . The null hypothesis is rejected at the final analysis if T 1 + 2 ≥ q 1 − α 1 + 2 , where α 1 + 2 corresponds to the local one-sided significance level for the final analysis. Local significance levels can, for example, be chosen according to the adjustments proposed by Pocock 17 or by O'Brien and Fleming. 18

| Sample size recalculation approaches
There exist different ways for adapting the sample size during an ongoing trial. One of the simplest methods is the group sequential study design (GS), where the sample size for every stage equals a fixed pre-defined number. Group sequential designs may be considered as a special case of adaptive group sequential designs where the stage-wise sample sizes can be chosen in a data-dependent way. 5,19 Comparisons of the group sequential and the adaptive group sequential study designs can be found in References. 12,[20][21][22] The most popular strategy in adaptive group sequential study designs is to update the sample size such that a certain pre-specified conditional power value is reached. 3 The conditional power describes the probability of correctly rejecting the null hypothesis, given the observed value of the test statistic at interim T 1 = t 1 and total sample size n per group. Moreover, it depends on the true standardized treatment effect Δ = (μ I − μ C )/σ. The corresponding formula looks as follows: 0, if trial is stopped early for futility, if the sample size is recalculated, 1, if trial is stopped early for efficacy: In the following, we describe three ways of recalculating the sample size based on the observed conditional power which means that in the above formula Δ is replaced by the observed interim effect t 1 Á ffiffiffiffiffiffiffiffiffi ffi 2=n 1 p . There are similar approaches which insert the assumed effect for Δ, also referred to anticipated conditional power. Since the focus of this work is on the effects of the proposed resampling tool, which can be combined with every existing recalculation rule, it is not so essential here to cover a wide range of recalculation rules differing just by the way which value to employ for Δ. Therefore, we only consider recalculation rules based on the observed conditional power here. For all investigated approaches, we limit the maximal total sample size per group to n max for feasibility reasons. A more detailed description of the following three recalculation rules is given in Herrmann et al. 23

| The observed conditional power approach
For observed interim test statistics t 1 falling in the recalculation area q 1 − α 0 ; q 1 − α 1 Â Á , the sample size per group that ensures an observed conditional power of 1 − β is determined by the smallest integerñ that fulfills the inequalitỹ The total sample size per group according to the observed conditional power (OCP) approach is given by

| The restricted observed conditional power approach
The restricted observed conditional power (ROCP) approach is very similar to the observed conditional power approach but, as the name suggests, it contains a restriction. One point of criticism regarding the observed conditional power approach is that when formula (6) demands higher sample sizes than the maximal sample size n max , the sample size is fixed to n max irrespective of the conditional power that can be obtained by this highest sample size. Hence, it could be reasonable to augment the sample only if a certain minimal conditional power 1−β ROCP low can be attained. Consequently, the total sample size per group for the ROCP approach equals and CP t 1 , n max ð Þ≥ 1 −β ROCP low , else:

| The promising zone approach
The promising zone (PZ) approach was proposed by Mehta and Pocock. 8 They specify an initial total sample size n ini per group that is smaller than the maximally allowed total sample size n max per group. Moreover, they define a lower bound for the conditional power 1 −β PZ low . Note that 1 −β PZ low does not necessarily need to equal 1 − β ROCP low . Depending on the observed interim test statistic t 1 , the updated sample size is determined either by the initially proposed total sample size n ini , asñ according to (6), or as the limiting maximal sample size n max per group. Explicitly, the total sample size per group according to the PZ approach is given by n ini if t 1 RA and CP t 1 ,n ini ð Þ< 1−β PZ low , or t 1 RA and CP t 1 ,n ini ð Þ≥ 1 −β, n 1 else:

| Performance evaluation for sample size recalculation rules
When aiming at improving sample size recalculation rules, appropriate performance evaluation criteria need to be prespecified. Typical evaluation criteria are the average sample size as well as global power. Both can be considered as random variables in the adaptive design setting. Herrmann et al. 23 pointed out that it is important to not only consider these location measures but to additionally provide variation measures. Generally, there are different perspectives for evaluating the performance of a sample size recalculation rule. The global perspective is the one before the trial is started and thus takes an "average" look on the two options to stop the trial early or to recalculate the sample size at interim. However, it comes with the difficulty that the mixture of performance features related to an early stop and performance features related to a sample size recalculation may be difficult to interpret. Another option is to consider the conditional perspective. Here, the researcher asks before the trial is started how the sample size should be recalculated if at interim the observed effect falls within the recalculation area. This means that we evaluate the recalculation rules under the assumption that we already know that the trial is not stopped at interim t 1 q 1 − α 0 ; q 1 − α 1 Â Á À Á but we do not know the observed t 1 -value yet. The word "conditional" thus refers to the recalculation area and not to a particular value of t 1 . In this paper, we focus on this conditional perspective. Therefore, we investigate sample size recalculation possibilities under the following performance measures: All performance measures are simulated under a range of true standardized effect sizes Δ = μ I − μ C σ . Moreover, all these evaluation criteria can be combined within a single performance value, the conditional performance score CS by Herrmann et al. 23 Here, we only describe the key features of this score. The conditional performance score consists of four components: a location and variation component for the conditional power (e CP (Δ) and v CP (Δ)), as well as a location and variation component for the conditional sample size (e CN (Δ) and v CN (Δ)). The underlying idea for the location components is to evaluate the expected values in relation to pre-defined target values. If the maximally allowed sample size is not greater than the related fixed sample size and the effect size does not equal zero, then the initially planned power value 1 − β is taken as target value for the conditional power and the fixed sample size as target value for the conditional sample size. In the other cases, the first stage's sample size n 1 and the global one-sided significance level α are defined as target values since the trial may then be declared as not worth the effort to continue with the second stage. 23 Concerning the variation components, the observed variation is compared to the maximally possible variation in the respective setting. All four score components can take values between 0 and 1. Hence, they can all be evaluated separately and it is possible to combine them in two sub-scores, a conditional power sub-score SCP(Δ) and a conditional sample size sub-score SCN(Δ), or to a single performance value, the conditional performance score CS given by For all (sub-)scores and components it holds that larger values correspond to a better performance. The components within the two sub-scores can be weighted in different ways by considering, for example for the conditional power subscore where γ loc and γ var describe the respective weights for the location component e CP as well as variation component v CP .
The same construction applies to the conditional sample size sub-score. In this paper, we consider an equal weighting of all components, that is, γ loc = γ var = 0.5. The detailed formulas for the conditional performance components and (sub-) scores can be found in the Appendix and were initially published by Herrmann et al. 23

| Resampling approach for sample size recalculation
To incorporate the variability of the interim effect, resampling-as a tool to assess the variability of a random variable-may be an option worth to be considered. The resampling approach is only performed if the observed interim test statistic falls within the recalculation area, hence a second stage is suggested. Therefore, we resample B test statistics from a normal distribution with the observed interim test statistic as mean and a standard deviation of 1. All resampled test statistics, also the ones that do not fall into the RA, are included in the computation of the final value of the second stage's sample size as follows: For each of the B resampled test statistics, the second-stage sample size is recalculated resulting in a set of samples sizesñ where (*) denotes the index for the initial sample size recalculation rule. Note that some of the "recalculated" sample sizes may thus correspond to the initial sample size n 1 . Finally, a summary location measure combining all B sample sizes determines the final value of the second-stage sample size. Here, we distinguish between two approaches: 1. The simplest approach is to define the second stage sample size as the mean of all resampled sample sizes We denote this resampling method as the R1 approach.
2. Since the first stage's sample size has a large influence on the resampled sample size, an alternative option is considered where we use the mean plus the standard deviation of the resampled sample sizes to obtain the final value for the second stage sample size The standard deviation as an additional term implies that higher sample sizes are chosen. We call this method the R2 approach. Note that instead of adding the standard deviation, in principle also other measures of the resampled sample size distribution (e.g., predefined quantiles) could be used to achieve this effect. Therefore, the R2 approach is just to be seen as one exemplary option. Figure 1 shows the recalculated sample sizes for the observed conditional power, restricted observed conditional power, and promising zone approach for the original recalculation rules (gray lines), the recalculation rules combined with the resampling method R1 (blue lines) and the recalculation rules with the resampling method R2 (green lines). The first stage's sample size was set to 50, the maximal sample size limited to 200, the futility stop bound α 0 was set to 0.5, and local significance levels were adjusted according to Pocock 17 to maintain the global significance level α = 0.025. Note that the resampling approaches converge to a smooth line if the number B of resampling samples tends to infinity. Here, every single point of the blue and green lines in the figure represents the mean of B = 5,000 resampled sample sizes based on the observed interim test statistics.

| SIMULATION STUDY
To evaluate the performance of the different sample size recalculation approaches presented above, we conducted a simulation study 24 with the software R. 25 Random numbers were generated by the function rnorm. The resulting approaches were evaluated by means of specific performance characteristics and by the new conditional performance score (10) with parameter settings γ loc = γ var = 0.5.

| Simulation setup
For the simulation study, we rely on the design characteristics described in Section 2. Thus, we chose equally sized groups of n 1 = 50 subjects in the first stage. The initial second stage sample size per group was set to n 2 = 50, hence leading to an initial overall sample size per group of n ini = n 1 + n 2 = 100, and the maximum possible sample size per group is set to four times the interim sample size n 1 by n max = 200. Accordingly, the weights for the inverse normal combination test were selected as w 1 = w 2 = ffiffiffiffiffi 50 p . The global one-sided significance level was given by α = 0.025 and the local significance levels were calculated according to Pocock, 17 that is, α 1 = α 1 + 2 = 0.0147. Moreover, the futility bound was set to α 0 = 0.5. The desired conditional power was chosen to be 1 − β = 0.8. The lower bound for the conditional power in the restricted observed conditional power approach (ROCP) was fixed to 1 −β ROCP low = 0:6. The lower bound for the promising zone (PZ) was set to 1 − β PZ low = 0:36, as applied by Mehta and Pocock. 8 We investigated the performance of the different designs for a variety of underlying true standardized treatment effect Δ {0.0,0.1,0.2,0.3,0.4,0.5}. For each scenario, 10,000 simulation samples were drawn. The number of samples for the resampling approaches was set to B = 5,000. For the sake of comparison, a group sequential design (GS) with n 1 = n 2 = 50 and the same decision boundaries as given above was also simulated and evaluated.

| Simulation results
Detailed simulation results can be found in Tables A1-A3 in the Appendix. In the following, we only present and discuss the main performance results with respect to the conditional performance score which are shown in Table 1.
In the standard sample size recalculation scenario without resampling, the group sequential study design is the clear performance winner except for a true underlying effect of Δ = 0.3. This is mainly due to no variation in recalculated Promising zone approach F I G U R E 1 Sample size recalculation rules as functions of the observed interim test statistic. The gray solid lines present the original recalculation rules, the blue solid lines describe the resampling approach with the mean as summary location measure (R1 approach) and the green solid lines describe the resampling approach with the mean plus standard deviation as summary location measure (R2 approach) sample sizes. The first resampling approach with the mean as summary location measure (R1 approach) performs better than the respective standard sample size recalculation rules without resampling with respect to the conditional performance score for all considered true standardized effect sizes Δ (Table 1, Columns 3 and 4). This is particularly due to the fact that the resampling approach reduces the variability in the recalculated sample sizes for all Δ. Moreover, the R1 approach does either perform even better than the group sequential approach or reveals a very similar total conditional performance score for the restricted observed conditional power and promising zone approach (Table 1, Columns 3 and 4). This stems mainly from a better performance with respect to the conditional power (Tables A1 and A2 in the Appendix). The performance of the observed conditional power approach is in most cases worse than the group sequential approach especially due to a worse conditional sample size performance (Table A2 in the Appendix). Moreover, the sample size recalculation approaches with resampling smooth the sample size curves and have the tendency to mimic the sample size shape of the group sequential design (cf. Figure 1) which is given by a horizontal line at n = n 1 + n 2 = 100 within the recalculation area. The shape of the sample size recalculation curve of the promising zone approach with resampling is here closest to the group sequential approach.
Note that there is a trend towards the first stage's sample size n 1 for the sample sizes recalculated with the R1 approach. This is due to the fact that test statistics outside the recalculation area might be resampled even if the observed interim test statistics falls within RA. To overcome this issue, resampling approach R2 is based on a different summary location measure which is given by the mean of the resampled sample sizes plus its standard deviation.
As it can be seen from the green lines in Figure 1, the observed conditional power approach with the R2 approach now has a stronger trend towards a group sequential design due to the sample size boundary n max . Compared to the observed conditional power approach with resampling and the mean as summary location measure (R1 approach, blue lines), the step size at the upper edge of the recalculation area is higher. The restricted observed conditional power R2 approach looks similar to the restricted observed conditional power R1 approach (cf. Figure 1) but is located at a higher position. For the promising zone approach combined with the R2 approach, the shape of the sample size curve is now nearly horizontal and thus looks again very similar to a group sequential design with constant sample sizes per stage. Within the R2 method, the promising zone has the best overall conditional performance scores for most values of Δ (Table 1, Column 5) compared to the observed conditional and restricted observed conditional power R2 approaches. Overall, the R2 approaches perform again better than the original sample size recalculation approaches without resampling (Table 1, Columns 3 and 5). This is again mainly due to a reduction in variability of the conditional sample size (Tables A1 and A2 in the Appendix) since also here the resampling leads to more robust approaches. However, the group sequential approach outperforms the different R2 methods (

| Clinical trial example
Based on a clinical trial of Bowden and Mander, 26 we consider a new and a standard treatment, N and S, for osteoarthritis patients with respect to pain relief after 2 weeks compared to baseline. Pain relief is measured on the McGill pain scale 27 where values range from 0 referring to no pain until 50 referring to maximally possible pain. The values are supposed to be normally distributed. For the sake of illustration, we adapt the initial clinical trial design to meet the design requirement of the methods proposed in here as already proposed by Herrmann and Rauch. 28 We assume that superiority of the new treatment is known from a pilot study but further evidence is required to quantify the effect size. Therefore, the hypotheses where μ N baseline −μ N 2weeks À Á describes the expected pain relief after 2 weeks for the new treatment and μ S baseline −μ S 2weeks À Á for the standard treatment. The study is evaluated with an adaptive two-stage design and the possibility to recalculate the sample size at the interim analysis. More precisely, we choose n 1 = n 2 = 50 and n max = 4 Á n 1 = 200 as well as choose the inverse normal combination test 11 with weights w 1 = w 2 = ffiffiffiffiffi 50 p . Furthermore, we decide on a binding futility stop bound α 0 = 0.5, a global significance level α = 0.025 and local significance levels adjusted according to Pocock. 17 Suppose we observe an interim effect size Δ = 0.2, referring to an interim test statistic of T 1 = 1, and we are interested in the conditional performance differences of the OCP, ROCP and PZ approaches with and without the R1 resampling approach. For the evaluation, we use an equal weighting of the conditional performance score components as a large recalculated sample size is only justifiable if it is not caused by random variation and if the sample size time meets the target value at the same. Therefore, variation and location components are considered as equally important. We primarily focus on the performance for Δ = 0.2, which corresponds to the observed effect size, but also take the performance of the neighboring effect sizes Δ = 0.1 and 0.3 into account. The performance values are given in Table 1 and  Tables A1 and A2. Without resampling and an interim effect size of Δ = 0.2, the OCP approach would suggest the maximal sample size of 200, whereas the ROCP approach suggests no increase of the sample size at all and thus no second stage of the trial, and the PZ approach suggests the initially planned total sample size of 100. The resampling R1 approach suggests for all three approaches (OCP, ROCP, PZ) a trial continuation with total sample sizes varying between at least 75 and at most 150. The overall conditional performance measured by the conditional performance score is better for the R1 approach than for the original approach for all three recalculation rules and all three considered effect sizes (cf. Table 1). This comes mainly from the variance reduction of the conditional sample size and power by the resampling approach (cf . Tables A1 and A2). For Δ = 0.1 and 0.2, the ROCP R1 resampling approach performs best whereas for 0.3, the OCP R1 resampling approach turns out to be the best (cf. Table 1). This change in the ranking is mainly due to the change in underlying target values for an effect of 0.3. If one is also interested in the global performance, the OCP R1 approach attains a higher global power than the other ROCP and PZ R1 approaches across the considered effect sizes due to higher sample sizes. As a general result it can be deduced that the resampling approaches flatten the shape of the sample size function and thus reduce the variability.

| DISCUSSION AND CONCLUSIONS
Incorporating resampling to the interim test statistics in established sample size recalculation rules leads to more robust recalculation approaches with a considerable performance improvement with respect to individual performance characteristics and a conditional performance score, mainly due to the reduced variance in the conditional sample size and conditional power. This was also seen in a fictitious clinical trial example. Note the weighting scheme of the conditional performance score and its reference values might also be chosen differently. Moreover, note that the observed performance jumps around Δ = 0.3 for the conditional performance score are a general property of recalculation rules as for small effects no increase in sample size is favorable whereas from a certain medium effect an increase in sample size is reasonable. Irrespective of the performance score, the application of the proposed resampling tool resulted in a smoothing of the sample size curve. The form of the smoothed sample size function is concave where the kurtosis nearly vanishes for some scenarios. Thus, the sample size function approaches a constant line in some situations which in turn mimics a group sequential design. The concave form of the smoothed function means that within a certain interval of the recalculation area, the sample size increases with increasing observed interim effect. One might argue that an increase in sample size with increasing interim test statistic is not reasonable. However, despite their unintuitive character, concave sample size functions have shown to be optimal in some particular settings. 29 Furthermore, one could also argue that large "jumps" in the sample size function are also not reasonable as this implies that the sample size changes considerably if the observed test statistic is only minimally changed. Hence, they can be seen as two opposite points of view: On the one hand unintuitive "jumps" in sample size can be avoided with smoothed sample size curves by concave function shapes, and on the other hand this results in sample size functions which are no longer monotonically decreasing in the interim test statistic, which also is not intuitive. Note that these "jumps" are part of nearly all established sample size recalculation rules. In areas where conventional recalculation rules show these "jumps", the resampling approach defines a compromise between the extremes. One might also say that any sample size recalculation rule that includes large "jumps" is generally not reasonable and, as a consequence, the compromise proposed by the resampling approach cannot be optimal either. A general recommendation is thus to choose the design settings such that large jumps are omitted, for example, through a smaller maximal sample size n max or a larger local significance level α 1 + 2 . Even though the resampling approaches outperform the original sample size recalculation rules with respect to the conditional performance score, it does not mean that the resulting sample sizes are point-wise optimal. It rather reduces the average risk of choosing an entirely wrong sample size, which leads on average to good results. In the individual case, however, this can be fundamentally wrong. The latter is of course not a negative feature for the resampling approach but generally holds true for sample size recalculation rules. We believe that sample size recalculation rules with resampling are a good approach to take account of the cost-benefit ratio. Due to the characteristic of reducing the average distance to the ideal sample size, the method is suitable to balance the costs and the benefits of a study by choosing the best trade-off between both.
The visual similarity of the procedures with resampling to sample size recalculation for group sequential designs is remarkable. Especially the promising zone approach combined with resampling approximates a standard group sequential design. This is because the promising zone approach includes large sample size jumps in a very small range of observed interim effects and this small range of large sample sizes has a low impact on the smoothed sample size curve. This further supports the thesis that group sequential designs might have an exceptional position among designs with sample size recalculation (cf. also References 12,20,21 ). However, while sample size recalculation based on group sequential designs does only depend on the interim test statistic with respect to stopping the trial early or not, the incorporation of resampling to sample size recalculation rules offers the possibility to base sample size recalculation on conditional power considerations and still avoid severe fluctuations in sample size. Hence, resampling makes sample size recalculation rules more robust and addresses obviously the randomness of the observed interim test statistic. Combined with the simulation code supplied, this may add to an appealing possibility of improving sample size recalculation rules in adaptive study designs.
Note that the resampling approaches described in formulas (12) and (13) may also be applied to studies with other types of endpoints as long as the test statistics are approximately normally distributed. The application of the resampling approach to binary endpoints is straightforward using the normal approximation test for rates. For time-toevent endpoints, the logrank test also allows to be applied in an adaptive design setting. 30 As an alternative to the resampling approach proposed in here, a more direct approach to improve the performance of sample size recalculation could be to define a sample size recalculation function that optimizes the conditional performance score. This idea can be based on a numerical constrained optimization framework. The implementation of this alternative approach and the comparison to the resampling approach will be the task of future work. Formulas for the conditional performance score In the following, the formulas for the conditional performance score together with its sub-scores are presented. More details for the motivation of the score can be found in Herrmann et al. 23 First, we describe the four components (e CN , v CN , e CP , v CP ) and the two sub-scores (SCN, SCP) which can be composed to define the total score CS. The underlying idea of the two location components, e CP and e CN , is to compare the evaluated average conditional power and average conditional sample size with predefined target values. We specified the target values for the sample size as CN target ≔ n fix Δ , if n fix Δ ≤ n max and Δ≠0, n 1 , if n fix Δ > n max or Δ = 0, where n fix Δ refers to the required sample size in a fixed study design. The target values for the conditional power are given as T A B L E A 1 Performance summary for the sample size recalculation rules without resampling for the design settings described in Section 3.1 Numbers in brackets below the standardized treatment effects represent the target values for the sample size. Performance measure abbreviations are stated in the beginning of Section "Detailed performance evaluation". Abbreviations: OCP, observed conditional power approach; ROCP, restricted observed conditional power approach; PZ, promising zone approach.