An alternative metric for evaluating the potential patient benefit of response‐adaptive randomization procedures

When planning a two‐arm group sequential clinical trial with a binary primary outcome that has severe implications for quality of life (e.g., mortality), investigators may strive to find the design that maximizes in‐trial patient benefit. In such cases, Bayesian response‐adaptive randomization (BRAR) is often considered because it can alter the allocation ratio throughout the trial in favor of the treatment that is currently performing better. Although previous studies have recommended using fixed randomization over BRAR based on patient benefit metrics calculated from the realized trial sample size, these previous comparisons have been limited by failures to hold type I and II error rates constant across designs or consider the impacts on all individuals directly affected by the design choice. In this paper, we propose a metric for comparing designs with the same type I and II error rates that reflects expected outcomes among individuals who would participate in the trial if enrollment is open when they become eligible. We demonstrate how to use the proposed metric to guide the choice of design in the context of two recent trials in persons suffering out of hospital cardiac arrest. Using computer simulation, we demonstrate that various implementations of group sequential BRAR offer modest improvements with respect to the proposed metric relative to conventional group sequential monitoring alone.


INTRODUCTION
Randomized controlled trials (RCTs) are widely regarded as the gold standard for studying the efficacy of new treatments. Randomly assigning participants to either treatment or control precludes selection bias and, on average, balances groups with respect to both known and unknown confounders so that unbiased treatment effect estimates can be obtained (Altman and Bland, 1999;Lee et al., 2012). Two-arm RCTs traditionally allocate participants to treatments using 1:1 randomization throughout the course of This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2022 The Authors. Biometrics published by Wiley Periodicals LLC on behalf of International Biometric Society. the trial to obtain unbiased comparisons and high power (Lee et al., 2012;Thall et al., 2015). However, physicians and potential participants often find it appealing when a trial allows for a higher probability of being randomized to the better performing arm as the trial data accrue (Palmer, 1993;Palmer and Rosenberger, 1999). For this reason, response-adaptive randomization (RAR), which alters the allocation ratio based on accruing data in favor of the empirically superior treatment, has garnered considerable attention in Phase II clinical trials (Thall and Wathen, 2007;Berry et al., 2011;Pallmann et al., 2018).
A recent example of a trial using a Bayesian RAR design is the Advanced R 2 Eperfusion STrategies for Refractory Cardiac Arrest (ARREST) trial (ClinicalTrials.gov,NCT03880565), which assessed the impact of extracorporeal membrane oxygenation-facilitated resuscitation versus standard advanced cardiac life support on survival to hospital discharge in adults who experienced an out-ofhospital cardiac arrest and refractory ventricular fibrillation (Yannopoulos et al., 2020, a) (Yannopoulos et al., 2020, b). Another example is the ACCESS trial (ClinicalTrials.gov, NCT03119571) in adults who were resuscitated from an out-of-hospital ventricular fibrillation or ventricular tachycardia cardiac arrest, which evaluated whether being admitted directly to the cardiac catheterization laboratory versus the intensive care unit improved survival to hospital discharge with no more than moderate disability, including the ability to walk on their own (modified Rankin scale score ≤ 3) (Yannopoulos and Aufderheide, 2021).
Bayesian RAR is most easily conceptualized in the context of a two-arm trial with binary outcomes. Let and , respectively, denote the true, but unknown, probability of response under the E=treatment and C=control, the available data from participants, and > = Pr( > | ) the posterior probability that the probability of a response is greater under the treatment than the control. Thompson (1933) first proposed randomizing the ( + 1)st participant to the treatment with probability > and to the control with probability > , which is equivalent to 1− > in the two-arm setting. When the treatment is efficacious, this sampling method, in addition to several modifications to Thompson sampling that have since been proposed, gives participants a greater chance of receiving the treatment during the trial (Thall and Wathen, 2007;Wathen and Thall, 2017). Yet, Bayesian RAR remains contentious and has seen limited use in practice. It is commonly criticized for inducing biased treatment effect estimates, type I error inflation in the presence of trial population drift or effect heterogeneity over time, reduced power relative to 1:1 allocation, and nonnegligible probabilities of allocating more participants to the inferior treatment (Hey and Kimmelman, 2015;Korn and Freidlin, 2010;Thall et al., 2015Thall et al., , 2016. When planning a Phase II trial, investigators often seek a design with specified type I and II error rates that is ethically, administratively, and economically feasible. For these reasons, a group sequential design is frequently used to facilitate early stopping for efficacy, harm, or futility at prespecified interim sample sizes throughout the trial, which protects participants from unnecessary exposure to ineffective or harmful treatments and allows limited resources to be allocated to other trials (Jennison and Turnbull, 2000). RAR may be considered when investigators wish to maximize in-trial patient benefit, or equivalently minimize in-trial patient harm. Over the last several decades, researchers have studied the potential patient benefits of combining group sequential and RAR methodology to limit exposure to an inferior arm. For example, Yao and Wei (1996) extended the randomized play-the winner (RPW) rule to a multistage setting and found that RAR provides ethical benefits to study participants while maintaining adequate power. Morgan and Coad (2007) investigated other group sequential adaptive sampling rules, including the drop-the-loser, randomized Polya urn, and sequential maximum likelihood estimation rules, and observed useful reductions in the number of treatment failures. Coad and Rosenberger (1999) further found that combining the RPW rule and triangular test for clinical trials with binary responses is an effective way to reduce the expected number of treatment failures and suggested extending their procedure to a multistage adaptive design similar to Yao and Wei (1996).
In this paper, we propose a metric for comparing group sequential designs based on the cohort most acutely impacted by the choice of design and illustrate how this metric may be applied to select a design in the ARREST and ACCESS contexts. RAR designs are commonly compared using inferential and estimation metrics (e.g., type I error, power, and bias) rather than measures of patient benefit, which remain underreported and have received little attention in the RAR literature (Robertson et al., 2020). This is in part because existing patient benefit metrics, including the expected number of trial failures, the proportion of patients assigned to the inferior arm, and the probability of a treatment imbalance in the wrong direction, are often limited by failures to hold type I and II error rates constant or to account for the different sample size requirements of the designs under consideration (Karrison et al., 2003;Morgan and Coad, 2007;Zhu and Hu, 2010;Robertson et al., 2020). One approach to correct for the latter issue is to compare designs with respect to the expected number of failures within a finite patient horizon   (Villar et al., 2015, b). However, as far as we are aware, no specific guidance exists for selecting an appropriate horizon and there is a need, as suggested by Robertson et al. (2020), for patient benefit metrics that clearly quantify the ethical properties of RAR designs while considering patient benefit both within and outside of a trial. Our proposed metric improves on existing patient benefit metrics by considering a set of feasible group sequential designs with equal type I and II error rates and measuring the expected number of failures in the fixed group of individuals who are directly impacted by the design choice. Namely, those who would participate in the trial if enrollment were open when they become eligible. This paper is organized as follows. In Section 2, we introduce our proposed metric. In Section 3, we discuss the underpinnings of Bayesian RAR group sequential designs and the probability model and randomization scheme used in our particular implementation. We consider six different variations of Bayesian RAR, including two modifications to Thompson sampling and Bayesian versions of Neyman allocation, the optimal allocation of Rosenberger et al. (2001), the doubly adaptive biased coin design (DABCD) of Hu and Zhang (2004), and the efficient randomized adaptive design (ERADE) of Hu et al. (2009). In Section 4, we outline our simulation study and provide results in the context of the ARREST and ACCESS trials. Although these results are in terms of a two-arm trial with a binary primary outcome and Bayesian RAR, our metric may be directly applied to contexts comparing multiple treatments or other RAR procedures. We conclude by discussing how our metric may be applied to select a design.

DESIGN COMPARISON METRIC
Our proposed metric measures the expected number of failures in the cohort that is directly impacted by the design choice for a set of practically feasible designs with equal type I and II error rates. We define the size of this cohort as the maximum sample size among all designs compared and call this group the potential study sample, as this cohort comprises all the individuals who will participate in the trial if it is open when they become eligible to enroll. Let = ( 1 , … , ) denote a set of designs satisfying budgetary, recruitment, or logistical constraints and achieving a desired significance level, , and power, 1 − , to detect the hypothesized treatment effect. In Section 4, we will discuss this set in the context of the ARREST and ACCESS trials. Let max, denote the maximum sample size for design that provides the specified error rates. Then, the size of the potential study sample may be defined as max = max =1,…, max, . For a particular design , suppose that is the sample size of the trial at the time the trial is stopped, and = ( 1 , … , ) and = ( 1 , … , ) are the randomization assignments and observed outcomes for trial participants = 1, … , , respectively, where ∈ { , } and ∈ {0, 1}. Conditional on the observed data = ( , ), we define the number of failures within the potential study sample as: where ( ) = I{reject 0 } is an indicator for whether the treatment is declared efficacious in design . That is, (1) reflects the number of failures in the actual trial plus the expected number of failures among the remaining individuals in the potential study sample under the arm the trial recommends. Because the observed data and decision rule depend on the design, we define the proposed metric as follows: for some loss function ( ). We focus on ( ) = ; however, other loss functions may be used to prioritize different objectives. For example, using ( ) = 2 (which is analogous to the traditional L2 loss function) would penalize designs that have a greater chance of yielding many failures in the potential study sample by squaring them. Alternatively, the median rather than mean loss could be used to compare designs while limiting the impact of outliers. The second line in (2) emphasizes that designs under consideration may use RAR, in which case the assignments are dependent on and all previous assignments and outcomes, whereas the outcomes are only dependent on the assignment.
We approximate using Monte Carlo integration; that is, we iteratively simulate and under design , calculate ( , max , , ) for each realization of the data, and then take the average. We make the working assumption that participants who would have enrolled in the trial had it continued will receive the treatment when it is shown to be superior and the control otherwise. This approach provides a way to reward a design that stops early to recommend treatment when it is, in fact, efficacious. Design * with * = argmin is optimal with respect to the proposed metric among the set of designs under consideration that provide the same type I and II error rates.
Our approach to finding the optimal group sequential design is similar to the horizon problem, which seeks a design that minimizes the expected number of patients receiving the inferior treatment within the trial and in the future, initially proposed by Anscombe (1963). Our potential study sample may be viewed as a fixed patient horizon that prioritizes individuals who are implicitly impacted by the decision to end a trial early or proceed with a smaller design. Larger patient horizons could have been considered; however, because we only consider designs with the same type I and II error rates, the long-run impact of these designs is similar, with each recommending treatment at the same rates under both the null and alternative hypotheses.

BAYESIAN RESPONSE-ADAPTIVE RANDOMIZATION GROUP SEQUENTIAL DESIGNS
RAR alters the allocation ratio throughout the trial based on accumulated data (i.e., past treatment assignments and patient outcomes) to achieve particular design objectives. In this paper, we consider various Bayesian RAR approaches that aim to maximize efficiency or reduce exposure to an inferior treatment. We implement each approach in a group sequential context that analyzes data up to = 1, … , times and allows a trial to be stopped early for efficacy or harm. When the trial continues, each Bayesian RAR approach modifies the randomization ratio for the next group of participants accordingly. Letting denote the target randomization probability to the treatment arm for group , we assume 1 = 0.5 and thereafter modify using data from the preceding interim analysis until the trial is stopped. We implement the Bayesian RAR designs using one probability model, four boundary shapes, eight adaptive modifications to the randomization probability, and one method for generating assignments under the targeted randomization probability.

Probability model
Two-arm Bayesian RAR designs with a binary primary outcome have been conventionally implemented using the independent beta-binomial probability model for and . We instead employ a logistic regression probability model with weakly informative t-distribution priors. This model has been shown to have increased power and reduced type I error rate sensitivity to the underlying response probability compared to a beta-binomial modeling approach due to shrinkage that arises from placing less prior density on extreme values of the response probabilities (Proper et al., 2021). A plot of the prior distributions for the beta-binomial and logistic regression models on the probability and log-odds scales is provided in Web Appendix A, which indicates that the prior distribution for the beta-binomial model is asymmetric on the logodds scale and places more mass on very small values of . Letting = and denote the experimental treatment and control arms, respectively, the logistic regression probability model arises from a logit transformation of : and is formally defined as follows: , = , where ( , ) denotes a generalized t-distribution with , , and representing the degrees of freedom, location, and scale parameters, respectively (Proper et al., 2021). The parameter advantageously allows investigators to induce statistical models that are robust to outliers by altering the kurtosis, or thickness of the tails, of the prior (Lange et al., 1989). It may also be used to control the influence of prior beliefs on statistical inference. As shown in Equation (4), we use a prior intercept location of log( ∕(1 − )); this value represents an estimate of the intercept parameter under the hypothesized null scenario and is easily derived by setting = and 1 = 0 in Equation (3). Following Ghosh et al. (2015), we use a conditionally conjugate Polya-Gamma Gibbs sampler to jointly sample 0 and 1 from the posterior distribution.

Thompson sampling
The first two adaptive modifications to the randomization probability that we consider are variations of Thompson sampling. We define > as the proportion of posterior samples where 1 > 0, which signifies that the treatment arm has greater odds of response than the control. Although it is intuitively appealing to set = > , the variability of > with small sample sizes may reduce power and exacerbate the nontrivial probability of a treatment imbalance in the wrong direction (Thall and Wathen, 2007). We implement two modifications proposed by Thall and Wathen that stabilize > prior to using it as a randomization probability (Thall and Wathen, 2007;Wathen and Thall, 2017). The first modification randomizes the first group using 1:1 allocation and sets , = min{0.75, max{0.25, > }} ∀ = 2, … , . This restriction limits power loss and selection bias that may result from extreme allocation to one arm. The second modification adapts > as follows: where is a positive tuning parameter equal to 0 for = 1 and 0.5 ⋅ ( − 1)∕( − 1) ∀ = 2, … , . Thall and Wathen (2007) originally proposed the use of a positive tuning parameter in a fully sequential trial; however, our expression in (5) is an adaptation of their proposal to a group sequential context. This modification is conservative in the beginning of the trial when it constrains near 0.5 (e.g., when = 1, = 0 and hence ,1 = 0.5), but becomes more aggressive later on as incrementally approaches 0.5. A plot of the values arising from these two adaptations for various > is provided in Web Appendix B.

Optimal allocations
We consider two optimal allocation ratios for two-arm clinical trials with a binary outcome. The first is Neyman allocation, which maximizes power by minimizing the variance of the test statistic (Melfi and Page, 1998). It is important to note that when + > 1, Neyman allocation undesirably allocates more participants to the arm with the lower response probability to maximize power. The second is the optimal allocation ratio proposed by Rosenberger et al. (2001) that, for a fixed variance of the test statistic, minimizes the expected number of nonresponders. We herein refer to this procedure as RSIHR allocation. For a fixed number of trial participants, = + , and assuming the difference in sample proportions is the test statistic of interest, one may show that the optimality criterion for Neyman allocation is satisfied by allocating the following proportion of participants to the treatment arm (Melfi and Page, 1998) and the expected number of nonresponders is minimized using (Rosenberger et al., 2001): Because the response probabilities are unknown, these optimal allocations must be approximated using sample data from an adaptive sequential design. Rosenberger et al. (2001) proposed estimators for (6) and (7) that replace and with the current sample proportions and achieve the optimal allocations in the limit. They suggest randomizing the th participant to the treatment arm using the optimal allocation estimator based on the data from the first − 1 participants. Instead, we generalize these optimal allocation ratios to a Bayesian RAR group sequential setting as described in Section 3.2.4.

Efficient adaptations
The final two adaptive modifications that we consider desirably lower the variance of the randomization procedure, which is inversely related to the average power of a randomization procedure for a given allocation proportion (Hu and Rosenberger, 2003). We first consider the DBCD proposed by Hu and Zhang (2004), which tends to the targeted allocation in the limit and has a smaller variance than the RPW rule and the adaptive randomized design. This procedure uses the following allocation function, , to precisely target any desired allocation proportion: where is the realized allocation proportion, is the targeted allocation proportion, (0, ) = 1, (1, ) = 0, and ≥ 0 is a constant chosen to control the trade-off between allocation randomness and the asymptotic variance of the design (Hu and Zhang, 2004). We set = 2 throughout our simulation study. Implementing this design in a fully sequential fashion, Hu and Zhang (2004) suggest randomizing the th participant to the treatment arm with probability = ( −1 , (ˆ, −1 ,ˆ, −1 )), where −1 is the proportion of participants receiving the treatment after − 1 allocations, ( , ) is the targeted allocation for the treatment arm, andˆ, −1 is an estimate of based on data for the first − 1 participants.
We next consider the ERADE proposed by Hu et al. (2009), which attains the Cramer-Rao lower bounds on the allocation variances for any allocation proportions. After using a restricted randomization procedure to allocate an initial 0 participants to treatment or control, ERADE allocates the th participant to the treatment arm with the following probability: where 0 ≤ ≤ 1 is a constant reflecting the degree of randomization. Per Hu et al. (2009), we use = 2∕3 throughout our simulation study. We generalize both DBCD and ERADE to a Bayesian RAR group sequential setting and use them to target Neyman and RSIHR allocation as described in Section 3.2.4.

3.2.4
Generalizing to a Bayesian RAR group sequential setting Neyman allocation, RSIHR allocation, DBCD, and ERADE were originally proposed for a fully sequential design. Instead, we generalize these adaptive modifications to a Bayesian RAR group sequential setting. We randomize the first group of participants using 1 = 0.5 and adapt ∀ = 2, … , using posterior mean estimates of the optimal allocations in (6) and (7) that arise as: where is the sequential group size. These posterior means are estimated using Gibbs sampling with algorithmic details provided in Web Appendix C.

Randomization method
The weighted coin randomization method is generally used to allocate participants among treatments in Bayesian RAR trials. This method assumes = ∀ = 1, … , , where is the conditional probability that the th participant in the th group is allocated to the treatment. We instead use an alternative randomization method proposed by Zhao (2015) called the mass weighted urn design (MWUD). Proper et al. (2021) have previously shown that, in conjunction with the logistic regression probability model, this method substantially reduces the probability of a treatment imbalance in favor of the inferior arm and more precisely targets the desired allocation throughout the trial. The MWUD uses an urn randomization scheme with one ball per treatment in the urn. The randomization schedule is generated by consecutively drawing one ball from the urn with replacement, where the probability of selecting a ball is proportional to its mass. The initial masses of the balls are and (1 − ) for the treatment and control, respectively. After every selection, the chosen ball loses one unit of mass that is redistributed in proportion to the target allocations. This approach can be implemented using the following equation: where −1, is the number of participants on treatment prior to the th allocation in group , and is a treatment imbalance parameter that restricts how far the realized allocation is permitted to deviate from the targeted allocation throughout the trial (Zhao, 2015). For example, when the current treatment allocation, −1, , exceeds the targeted allocation by more than , the next participant is assigned to the control with probability 1. Similar to Proper et al. (2021), we use = 3 throughout our simulation study.

Interim monitoring boundaries
We perform interim monitoring as follows. For the twosided setting, we compute > using all available outcome data at each interim analysis. When > ≥ or > ≤ , the trial is stopped early for efficacy or harm, respectively. For the one-sided setting, the trial is stopped early for efficacy when > ≥ or futility when > ≤ . In either setting, when < > < , no decision is made and the trial continues. We consider both a two-sided hypothesis test setting with symmetric boundaries where = 1 − and a one-sided hypothesis test setting with asymmetric boundaries where = and ≠ 1 − . Similar to Shi and Yin (2019), we use computer simulation to find the maximum sample sizes and { , } ∀ = 1, … , that control the frequentist properties of the Bayesian RAR group sequential designs (i.e., maintain type I error rate and power at the desired levels). We consider setting = ∀ = 1, … , to implement flat boundaries throughout the trial that are similar to Pocock. We also consider setting = Φ( √ ∕ ⋅ ), where Φ is the standard normal cumulative distribution function and is an arbitrary constant. Because this quantity becomes smaller with increasing , these boundaries are similar to O'Brien-Fleming (OBF) and become more aggressive as the trial proceeds. A description of how to find the maximum sample sizes and posterior probability stopping boundaries for a given design is provided in Web Appendix D. Example code executing this numerical procedure is provided in the Supporting Information.

Design considerations for the ARREST and ACCESS trials
Motivated by the ARREST and ACCESS trials, we performed a simulation study with 10,000 simulated trials per scenario to find the optimal two-arm group sequential design with respect to in (2). For each simulated trial, we generated a data set consisting of two potential outcomes for each individual in the potential study sample: one assuming that they received the treatment and one assuming that they received the control. For each trial participant, we set their observed outcome equal to the potential outcome corresponding to their randomization assignment. We then computed the expected number of failures among the group of individuals within the potential study sample but not the trial itself using the trial conclusion, as in (1). The same data sets were used to evaluate each design in each scenario.
To reflect the ARREST trial, we considered a hypothesized null scenario of = = 12% and a hypothesized alternative scenario of = 12% and = 37%. To reflect the ACCESS trial, we considered a hypothesized null scenario of = = 50% and a hypothesized alternative scenario of = 50% and = 65%. For completeness, we also considered contexts with hypothe-sized null scenarios of = = 5% and = = 80% with corresponding hypothesized alternative scenarios of = 5% and = 20% and = 80% and = 95%, respectively.
For each treatment effect configuration, we considered two design sets, , and , , composed of 48 different Bayesian RAR group sequential designs using symmetric or asymmetric posterior probability stopping boundaries, respectively. Each RAR design used: • the logistic regression probability model to estimate > ; • one of eight adaptive modifications in Section 3.2 to obtain randomization probabilities; • the MWUD to limit deviations from the target allocation; • = 3, 5, or 10 interim analyses; and • one of two posterior probability stopping boundaries in Section 3.4.
The eight adaptive modifications were chosen due to their familiarity in the literature or desirable statistical properties, though designs using other adaptive sampling methods could have been considered (see, for example, Hu and Rosenberger, 2006). More flexible group sequential procedures, such as the Lan-Demets alpha spending function, are available and common in practice (Lan and DeMets, 1983). However, we implement Pocock and OBF-like stopping boundaries due to their simplicity and widespread popularity. They also represent the two extremes typically encountered in practice: investigators tend not to use more aggressive boundaries than Pocock, or less aggressive boundaries than OBF. Notably, we consider a design with Pocock-like boundaries and = 10 interim looks, which reflects intensive monitoring with a substantial sample size inflation, to attain an upper bound for max and a standard against which to assess the performance of more feasible designs. For symmetric boundaries, designs were calibrated to maintain a two-sided type I error rate, defined as the proportion of simulated trials where > ≥ or > ≤ 1 − for some under the hypothesized null scenario, of = 0.05 and a power, defined as the proportion of simulated trials where > ≥ or > ≤ 1 − for some under the hypothesized alternative scenario, of 1 − = 0.90. For asymmetric boundaries, designs were calibrated to maintain a one-sided type I error rate, defined as the proportion of simulated trials where > ≥ for some = and > > ∀ = 1, … , − 1 under the hypothesized null scenario, of = 0.025 and power, defined as the proportion of simulated trials where > ≥ for some = and > > ∀ = 1, … , − 1 under the hypothesized alternative scenario, of 1 − = 0.90.

F I G U R E 1 Maximum ( Max ) and average (
) sample sizes under the targeted alternative for the ARREST and ACCESS trials with = 5. The terms Pocock-like and OBF-like, respectively, refer to group sequential designs using symmetric Pocock-like and OBF-like efficacy and harm boundaries. A dashed line is used to denote the sample size required for a fixed sample design with 1:1 allocation. Unconditional Exact denotes the frequentist group sequential design using 1:1 allocation. This figure appears in color in the electronic version of this article, and any mention of color refers to that version We compared each Bayesian RAR design to analogous frequentist designs using the MWUD to target a 1:1 allocation, increasing the size of , and , to 54. For the frequentist designs, we used the gsDesign R package (v3.1.1) (Anderson, 2020) to find the maximum sample sizes and stopping boundaries required to achieve the desired error rates. The accrued data at each interim analysis were analyzed using an unconditional exact test via the uncondEx-act2x2 function in the exact 2×2 R package (version 1.6.5) (Hirji, 2006;Fay and Hunsberger, 2020). Additional details pertaining to the frequentist designs are provided in Web Appendix E. We ran all simulations with R version 4.0.2 and the software for reproducing our simulation study is available in the Supporting Information. Figure 1 presents the maximum ( Max ) and average ( ) trial sample sizes under the targeted alternative for designs with = 5 in , for the ARREST and ACCESS trials. The size of the potential study sample was defined by the restricted Thompson sampling design with = 10 and Pocock-like stopping boundaries for both ARREST ( 1 max = 179) and ACCESS ( 2 max = 702). In the ARREST context, the Bayesian RAR designs generally had larger average sample sizes than the frequentist design and required more participants to achieve 90% power. Conversely, in the ACCESS context, the Bayesian RAR designs using ℎ , or ERADE or DBCD to target this optimal allocation ratio engendered as or more favorable sample size distributions than the frequentist design. Figure 2 displays the absolute difference between the expected number of trial failures in the group sequential designs with = 5 and in a fixed sample design using 1:1 allocation. The restricted Thompson sampling design engendered the greatest number of trial failures at the targeted alternative and when the observed treatment response rate was less than hypothesized. Although the frequentist design was consistently one of the top performers at the targeted alternative, at least one Bayesian RAR design matched or yielded fewer trial failures than this design for both ARREST and ACCESS. Differences between designs became negligible when the observed treatment response rate greatly exceeded the hypothesized value. Figure 3 contains an analogous figure for our proposed design comparison metric, , using either 1 max or 2 max as appropriate. In contrast to the average number of trial failures, the frequentist design generally performed the worst with respect to , whereas the restricted Thompson sampling design performed the best. For designs

F I G U R E 2
Absolute difference in the average number of trial failures relative to a fixed sample design with 1:1 allocation for various observed response rates in the treatment arm. The vertical red lines denote the hypothesized null and alternative response rates for the ARREST and ACCESS trials. These differences correspond to designs using = 5 and symmetric stopping boundaries. Unconditional Exact denotes the frequentist group sequential design using 1:1 allocation. This figure appears in color in the electronic version of this article, and any mention of color refers to that version implementing Pocock-like stopping boundaries in the ARREST context, the expected reduction in relative to a fixed sample design was 7.9 for the restricted Thompson sampling design and 5.1 for the frequentist design at the targeted alternative, indicating that RAR would prevent an additional 2.8 (35%) failures in the potential study sample on average. When the treatment response rate was smaller than hypothesized at 27%, the restricted Thompson sampling design prevented 3.7 (70%) additional failures, on average, relative to 1:1 allocation with group sequential monitoring. When the treatment response rate was larger than hypothesized at 47%, these expected marginal reductions were smaller at 1.7 (14%) additional failures. The potential patient benefit of using RAR over fixed 1:1 allocation appears greatest when the observed treatment effect is slightly smaller than hypothesized and diminishes when the observed treatment effect is greater than hypothesized. Using our proposed decision-theoretic framework from Section 2, the restricted Thompson sampling design with = 10 and Pocock-like stopping boundaries was the optimal design among the set of , designs under consideration for both ARREST and ACCESS.
Although the differences in between the Bayesian RAR designs and the frequentist design are modest, these reductions are on par with those achieved by group sequential monitoring. This is evidenced in Table 1, which contains the value of for each design in , for the ARREST and ACCESS trials. Consider the average number of failures in the potential study sample prevented by using the restricted Thompson sampling design with = 10 and Pocock-like stopping boundaries over the fixed sample design in the ARREST context: 130 − 120.2 = 9.8. Because = 124.0 for the corresponding frequentist design, we can infer that 6.0, or 61.2%, of the failures prevented by the Bayesian RAR design were prevented due to interim monitoring, whereas 3.8, or 38.8%, were prevented due to the Bayesian RAR procedure itself. Nearly identical percent reductions were observed for the ACCESS trial.
Similar figures and tables for = 3 and = 10, the low and high null response rate scenarios, and designs

F I G U R E 3
Absolute difference in expected failures among potential participants ( ) relative to a fixed sample design with 1:1 allocation for various observed response rates in the treatment arm. The vertical red lines denote the hypothesized null and alternative response rates for the ARREST and ACCESS trials. These differences correspond to designs using = 5 and symmetric stopping boundaries. Unconditional Exact denotes the frequentist group sequential design using 1:1 allocation. This figure appears in color in the electronic version of this article, and any mention of color refers to that version using asymmetric stopping boundaries are provided in Web Appendices F-I. The findings discussed herein generally held across all scenarios considered; however, differences between designs with respect to trial failures and were more apparent for designs targeting low or high null response rates. The restricted Thompson sampling design with = 10 and Pocock-like stopping boundaries was found to be the optimal design among , and , for each treatment effect configuration.

DISCUSSION
This paper proposes a design comparison metric that investigators may use to select a group sequential design that minimizes harm to potential participants, which may be appealing in trials where the primary outcome is mortality or some other outcome with severe implications to quality of life. This metric is simple to implement, may be applied to any set of designs and extended to a multiarm context, and provides a means to reasonably evaluate the potential patient benefit of RAR against classical frequentist group sequential designs. Our simulations indicate that the restricted Thompson sampling Bayesian RAR design tends to perform the best with respect to our metric across a variety of scenarios, and that Bayesian RAR offers modest reductions in the number of failures in the potential study sample relative to group sequential monitoring alone. We also found that Bayesian RAR group sequential designs exhibit greater marginal gains relative to group sequential designs using 1:1 allocation when the treatment effect is slightly smaller than hypothesized. Conversely, these gains diminish when the treatment effect is much larger than hypothesized, likely because both designs will stop very early with high probability that precludes adaptations to the allocation ratio. Whether these gains are worthwhile needs to be assessed within the context of each trial based on the severity of the primary outcome, the plausibility of observing various effect sizes, and the increased complexity that comes with Bayesian RAR implementation. This research has several limitations. First, due to the motivating context for this work, we only focused on twoarm trials with a binary primary outcome and a select set of Bayesian RAR procedures. Our conclusions regarding the potential patient gains arising from RAR with respect to our proposed metric may vary in different design settings or when alternative RAR procedures are considered. Next, we did not account for the presence of time trends including patient drift, which has been shown to inflate type I error rate in response-adaptive designs (Thall et al., 2016). We also made the working assumption that individuals in the potential study sample who did not participate in the trial would have received the treatment recommended by the trial. While this is a reasonable assumption for the motivating ACCESS and ARREST trials, this assumption may deviate from practice when the recommended treatment cannot be readily applied to the entire target population due to regulatory requirements and time required for dissemination and uptake of trial results. Yet, this assumption provides a reasonable way to reward a design that stops early to recommend treatment when it is, in fact, efficacious and enables our proposed metric to facilitate a fairer design comparison than existing metrics based on the realized trial sample size which often fail to (1) compare designs with equal type I and II error rates or (2) assess the impact of the design with respect to relevant, equal-sized populations. We also did not consider multiarm trials in which Bayesian RAR has been shown to increase efficiency relative to balanced randomization. An interesting area for future work would be to study our metric in the context of platform trials or multiarm multistage (MAMS) designs (see Lin andBunn, 2017 andWatson andTrippa, 2014). Finally, although we declare the optimal group sequential design to be the one that minimizes the average number of failures in the potential study sample among the designs considered, we did not fully optimize this criterion. Future work, including the evaluation of frequentist group sequential designs whose characteristics have been optimized such that the expected number of failures in the potential study sample is minimal, is needed to identify the design that fully minimizes the proposed metric.

A C K N O W L E D G M E N T S
The authors thank the three anonymous referees, the associate editor, and the coeditor for their helpful comments that substantially improved the quality of this paper. The authors also thank Medtronic Inc. for their support in the form of a Biostatistics Faculty Fellowship. This work was supported by the NIH/NCATS under Grant UL1TR002494 and by the NHLBI under award number T32HL129956.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings in this paper are available from the corresponding author upon reasonable request.