Response‐adaptive randomization for multi‐arm clinical trials using the forward looking Gittins index rule

The Gittins index provides a well established, computationally attractive, optimal solution to a class of resource allocation problems known collectively as the multi‐arm bandit problem. Its development was originally motivated by the problem of optimal patient allocation in multi‐arm clinical trials. However, it has never been used in practice, possibly for the following reasons: (1) it is fully sequential, i.e., the endpoint must be observable soon after treating a patient, reducing the medical settings to which it is applicable; (2) it is completely deterministic and thus removes randomization from the trial, which would naturally protect against various sources of bias. We propose a novel implementation of the Gittins index rule that overcomes these difficulties, trading off a small deviation from optimality for a fully randomized, adaptive group allocation procedure which offers substantial improvements in terms of patient benefit, especially relevant for small populations. We report the operating characteristics of our approach compared to existing methods of adaptive randomization using a recently published trial as motivation.


Introduction
Consider a clinical trial to test the effectiveness of several treatments on a group of patients. A rational observer, who is unfamiliar with established medical convention, might well suppose that patients would be allocated to treatments with the aim of optimizing their collective health. Indeed, using this idea as the initial motivation, Gittins and Jones (1979) first proposed an optimal, deterministic rule to perform such a task within a multi-arm clinical trial, which is termed the Gittins index. Since then, the Gittins index and its extensions have been extensively used to address resource allocation problems (collectively known as "multi-arm bandit" models) in a variety of scientific fields. Applications include: queuing theory and optimal scheduling (Aalto et al., 2009), computer-communication networks (Cechi andJacko, 2013) and internet marketing (Hauser et al., 2009), among others.
A large body of literature has accumulated on the use of bandit models to address ethical issues in trial design (see e.g., Berry and Fristedt, 1985;Berry and Eick, 1995;Press, 2009). Ironically, bandits, and more specifically the Gittins index, have never been used in clinical practice (to the best of the authors' knowledge) due to certain realities of medical research. These are discussed at length in a recent review by Villar et al. (2015), the main findings of which are now summarized.
The most immediate limitation is practical: in order to apply the Gittins index each patient's outcome needs to be observed before the next patient is assigned. This rules out the vast majority of medical conditions and diseases where the outcome of treatment is observed with non-negligible delay. Another major barrier, common to all optimal decisionanalytical designs (Cheng and Berry, 2007), is that the deterministic nature of the Gittins index runs contrary to the predominant mode of clinical trial design and analysis over the last 60 years. Under this framework patients are randomized to different treatment arms with equal probability. This has several desirable consequences. First, it guarantees (asymptotically) that patient groups will be balanced with respect to all characteristics (known or unknown), thus reducing the potential for confounding bias. Second, it provides a perfect vehicle to facilitate a double-blind experiment, making it harder for the trial organizers to induce allocation bias (see e.g., Chow and Chang, 2008). Third, fixed and equal randomization maximizes the learning about all treatments, i.e., it maximizes power to detect a significant difference between any two treatments in the trial by assigning similar numbers of patients to both. Having a high power is usually the primary aim of a trial and relegates the benefit to patients (within the trial) into second place. However, there are signs that this established view, as summarized in the Belmont report (Beauchamp, 2008), is changing. It is now not as common in rare disease settings, for example in pediatric oncology (Wang and Arnold, 2002).
The reduction in statistical power caused by adaptive allocation is not unique to these index rules, rather it affects all adaptive allocation procedures to a greater or lesser extent (Hu and Rosenberger, 2003). It is simply not possible to simultaneously maximize trial power while maximizing patient outcomes by skewing treatment allocation. That is why an ideal adaptive design would seek to assign more patients to the better treatment with the minimum sacrifice of inferential power.
In a multi-arm setting in which several treatments are assessed simultaneously against a shared control group, adaptive randomization does not face the power-patient benefit trade-off as separate two-arm trials do. The power of an adaptive multi-arm design can be increased by protecting the allocation to the control treatment (Trippa et al., 2012). By incorporating this feature, as shown in Villar et al. (2015), one can identify adaptive trial designs that score highly in both statistical power and patient benefit. Moreover, by testing many new promising treatments at the same time, multi-arm trials increase the probability of finding a successful new treatment and speed up the process of doing so.
This article addresses the remaining two limitations, i.e., its determinism and its fully sequential nature. We present an implementation of the Gittins index that introduces randomization, trading off a small amount of its optimality to gain applicability to real clinical trials. This is achieved by a modified algorithm that is probabilistic and is applied to blocks of patients rather than individuals, as suggested by Rosenberger and Lachin (1993).
One of the first works on randomized bandits was Yang and Zhu (2002). The authors consider a multi-arm problem with covariates and introduce a myopic randomized solution that is asymptotically consistent. Instead, we consider a problem without covariates and concentrate on introducing randomization to the Gittins index, by taking into account future sequences of allocations under Gittins' rule. Our simple, heuristic approach thereby aims to achieve a near-optimal mean total rewards criterion in a computationally feasible way.
The closest idea we know of in the literature is in Cheng and Berry (2007), where the introduction of randomization into an optimal design is considered. Treatment decisions are made using a decision-analytic approach whose goal is to maximize overall successful patient treatment over the patient population under the constraint that each arm must have at least probability r of being assigned to patients in the trial. Our algorithm does not impose any limitation on the allocation probabilities. This allows for arms to be dropped or promoted within a trial if there is enough evidence.
In Section 2, we introduce the Gittins index and demonstrate how it can be applied in a clinical trial setting with a binary outcome. We introduce our probabilistic implementation of the Gittins index, which we call the forward looking Gittins index (FLGI), in Section 3. In Section 4, we compare our approach to alternative procedures by performing simulations in the context of a recently published clinical trial: NeoSphere, (Gianni et al., 2012). We conclude with a discussion of our results and point to future research in Section 5.

Methods: the Gittins Index
Consider a clinical trial to test the effectiveness of K experimental treatments against a control treatment on a fixed sample of T patients. When a patient t (t = 1,...,T ) is allocated to treatment k (k = 0, . . . , K), where k = 0 denotes the control, then a binary outcome Y k,t is observed with Pr(Y k,t = 1) = p k , the true unknown response probability for treatment k. Assume that Y k,t =1 if the treatment is successful and 0 otherwise. Patients enter the trial one-by-one and the outcome for patient t is observed before patient t + 1 appears.
Let each parameter p k be assigned a Beta(s k,0 ,f k,0 ) prior density at the start of the trial, where s k,0 and f k,0 denote prior beliefs about the relative chances of success and failure of treatment k respectively. Given the conjugacy of the prior-and Bernoulli-distributed outcome, these priors are converted into Beta posteriors for p k via Bayes theorem as patients enter the trial, which are assigned to a treatment arm and subsequently experience a success or a failure. Let X k,t = (s k,0 + S k,t , f k,0 + F k,t ) be the two state-vector of available information on treatment k at time t, where the random vector (S k,t , F k,t ) represents the number of successful and unsuccessful outcomes for arm k up to patient t. The posterior for p k after having treated patient t and observing s k,t successes and f k,t failures is f (p k |x k,t ) ∼ Beta(s k,0 + s k,t ,f k,0 + f k,t ) with its posterior mean being E[p k |x k,t ] = s k,t +s k,0 s k,t +f k,t +s k,0+ f k,0 . Finally, let a k,t be the binary indicator variable denoting whether patient t + 1 is assigned to treatment k or not. The multi-armed bandit optimization problem is to find an allocation rule π such that: wherex 0 = (x k,0 ) K k=0 is the initial joint state with all the prior parameters, E π [·] denotes expectation under allocation rule π, and d is a discount factor (i.e., 0 ≤ d < 1) introduced for reasons of tractability so that a trial of infinite size (T = ∞) can be assumed. Thus, V * D (x 0 ) is the optimal expected totaldiscounted value function conditionalx 0 over , the family of admissible allocation rules (i.e., those for which it holds that K k=0 a k,t = 1 for all t). Put simply, (1) is the maximum average (discounted) number of patients responses attainable given the initial information on the available treatments before the start of the trial.
Since rewards are geometrically discounted, i.e., patient t's success yields a reward of d t , the choice of d plays a fundamental role. To see this, interpret d as the probability of the trial continuing after a given patient, so that (1 − d t ) is the probability of a trial terminating after patient t. If d is small (say 0.5), then optimization is done for an small expected number of patients (say 15, since (1 − 0.5) 15 < 10 −4 ). An index for the undiscounted finite horizon problem can be considered, though it introduces further complexities (Villar et al., 2015).
The solution to (1) can be found via dynamic programming, using a backward induction algorithm. This becomes computationally infeasible for relatively small values of T and K and is extremely difficult to implement in practice, since it specifies an arm to use in each possible trial accumulated history. An elegant and computationally tractable solution to (1) when considered for T = ∞ was given by Gittins and Jones (1979). They showed that the optimal rule obtained by backward induction is equivalent to the rule of allocating patient t + 1 to the arm with the highest Gittins index at time t. For treatment k, this is denoted G(x k,t ) where: (2) Gittins and Jones (1979) proved that one can attach an index G(x k,t ) to each treatment k, which depends only on its current information state, such that the optimal action for patient i is to be allocated to the treatment whose current index is greatest. The Gittins index is calculated by solving the problem of allocating patients optimally between treatment k and a known treatment which yields a constant reward. Calculations of the indices (2) have been reported in tables as in Gittins et al. (2011). A detailed explanation of how the Gittins index rule is deployed and a table with values are given in Tables 4 and 5 of Web Appendix A.

The Forward-Looking Gittins
Index Algorithm Assume that instead of enrolling T patients one-by-one, patients are enrolled in groups of size b over J stages, so that J × b = T . We wish to specify a rule based on the Gittins index that sequentially randomizes the next b patients among the K + 1 treatments at stage j (j=1,...,J) given the data up to block j − 1. This translates to determining π k,j , where: π k,j = the probability of allocation to treatment k at stage j (j = 1, . . . , J), which is common to all patients in block j, when using the Gittins index and given all data observed up to the stage j-1 (and therefore patient (j − 1) × b) , denoted byx (j−1)b . Note thatx (j−1)b can be written as a (K + 1) × 2 matrix in which row k represents the parameters of treatment k's current posterior distribution up to patient (j − 1)b. This marginal probability is obtained via the following formula: Here t−1 represents the set of all possible values for X t−1 given initial datax (j−1)b for every future patient t in (j − 1)b + 1, . . . , jb under the Gittins index rule (summarized by a GI k,t ). Each term of the summation within the square brackets of equation (3) represents the joint probability of allocating a future patient t in block j to treatment k and the current information state at patient t − 1, given the data at the beginning of block j − 1. Notice that r For the initial stage, i.e., for j = 1,x (j−1)b contains only the prior beliefs on the effectiveness of each treatment. When all treatments are assigned identical priors, it is clear that  Table 4 in Web Appendix A).
r For j ≥ 2, the allocation decision for the first patient in block j (i.e., for patient t = (j − 1)b + 1) is deterministic if a single treatment has a unique maximum Gittins index givenx (j−1)b . However, if multiple treatments k * ∈ {0, ..., K} are joint maxima, ties are broken at random among the k * .
r For j ≥ 2 and for subsequent patients in block j (t = (j − 1)b + 2, ..., jb), allocation probabilities are determined by averaging over the posterior predictive distribution of future data givenx (j−1)b . For Bernoulli data and Beta-prior, this is the well known Beta-Binomial distribution. Again, ties between treatments are broken at random.

Worked Example
To clarify how these probabilities are computed and their computational cost, we present a simple example of a twoarm trial testing a control treatment (k=0) against an experimental treatment (k=1) with a block of size 2 (i.e., b = 2 and K = 2). Suppose further that in the first block of the trial, having started with Beta(1,1) priors, the resulting allocation was, two patients to control-one success and another one a failure-and no patient to the experimental treatment. Therefore, for the second block the priors for each treatment arex 2 = [(2, 2); (1, 1)]. Figure 1 illustrates, via a probability tree, how the FLGI probabilities for block 2 given the data in block 1 are computed. Given that the experimental treatment would have the maximum Gittins index, the allocation decision for the first patient of the second block (i.e., patient 3) is deterministic. It follows that Pr(a GI 0,3 = 1|x 2 ) = 0 and Pr(a GI 1,3 = 1|x 2 ) = 1. When the second patient of the second block is to be allocated (i.e., patient 4), given that we have allocated the experimental treatment to the first patient, two possible outcomes can occur. If a success occurs, which happens with probability 1/2 = 0.5, then the experimental treatment is allocated again. If a failure occurs, then the control treatment is allocated.

Table 1
Comparison of the mean value of successes under the optimal rule for K = 2, b = 2 (obtained via dynamic programming), fixed randomized (FR), the FLGI, and the GI (Gittins index) in 10 3 replicas of a trial of size T = 30. The discount factor used for GI and FLGI is d = 0.7 (so as to x 3 )Pr(X 3 =x 3 |x 2 ) reduces to 1/2 for k = 0, 1. Using equation (3), we can obtain π 0,2 = (0 + 1/2)/2 = 1/4, π 1,2 = (1 + 1/2)/2 = 3/4. From this example, it is clear that the computational cost of computing the π kj 's, which depends on the joint state for the K + 1 arms, i.e.,x t (instead of the one-arm state x t ), will grow exponentially as b and K increase. Hence, we use a Monte-Carlo algorithm for this purpose.
The optimal allocation for any K, b, and T solves a dynamic programming problem. To assess the FLGI's performance, we consider a two-arm trial of 30 patients recruited in 15 separate blocks (T = 30, J = 15, b = 2, K = 2). The optimal rule in this case allocates one patient per treatment in the first block and both patients to the treatment with the highest posterior mean in the last block. For blocks between 2 and 14, optimal actions are less straightforward to summarize. The FLGI for b = 2 constitutes a heuristic that approximates the optimal policy as shown in Table 1 (See Web Appendix C for the problem formulation).

Simulation Study
We now evaluate the properties of the FLGI procedure by simulation, focusing on its: statistical power (1 − β); type I error rate (α); expected number of patients assigned to the best treatment (p * ); and expected number of patient successes (ENS). We compare the FLGI to the following established group allocation procedures (see Table 2): (1) Fixed randomization (FR): patients are allocated between the K experimental arms and the control arm with a fixed and equal probability at every interim stage.
(2) Thompson sampling (TS): assuming a uniform prior on [0, 1] K+1 for (p 0 , . . . , p k ), following Thompson (1933), we allocate treatment k to patients in block j with a probability proportional to the posterior probability that p k is the largest response rate given the observed data. Glioblastoma. Although the original cancer setting determined the use of a time-to-event outcome, the design approach has been transferred to the binary data setting (Wason and Trippa, 2014). (4) Optimal allocation ratios: following the implementation in Tymofyeyev et al. (2007), we consider two different rules: Neyman allocation (NA), which minimizes total sample sizes for a given power constraint, and RSIHR allocation (RSIHRA), which minimizes expected failures given a power constraint and is named after the authors that first introduced it in a two-armed setting (Rosenberger et al., 2001). (5) Controlled FLGI (CFLGI): in addition to the rule defined by (3), we shall consider a controlled group allocation rule which, similarly to the Trippa procedure, protects the allocation to the control treatment, so it never goes below 1/(K + 1) during the trial.
Note that in choosing which rules to compare our procedure against, we wanted to include methods that are used in practice, although fixed randomization is still the most popular allocation rule in use. In all simulations, we used a uniform Beta(1,1) prior for each treatment arm and we computed the allocation probabilities for Thompson sampling, Trippa procedure, FLGI, and CFLGI using a simple Monte-Carlo approximation based on 10 2 replicas (Matlab code is available with this article at the Biometrics website).

The NeoSphere Trial
The NeoSphere trial, as reported by Gianni et al. (2012), evaluated the effect of a combination of drugs on pathological complete response in the breast for women with locally advanced inflammatory breast cancer. Out of 417 eligible patients, 107 were randomly assigned to group 0 (trastuzumab plus docetaxel), 107 to group 1 (pertuzumab and trastuzumab plus docetaxel), 107 to group 2 (pertuzumab and trastuzumab), and 96 to group 3 (pertuzumab plus docetaxel). The original trial allocated patients between the four arms with a fixed and equal randomization probability of 1/4. The primary response outcome was available 24 weeks after randomization. At the end of the trial, the response rates observed in groups 0, 1, 2 and 3 were 29.0%, 45.8%, 16.8%, and 24.0%, respectively.
We show the impact of redesigning the study using the FLGI and the alternative patient allocation rules. Two response rate scenarios are initially considered : (i) the global null where p i = 0.290, i = 0, 1, 2, 3; (ii) the observed values in NeoSphere, p 0 = 0.290, p 1 = 0.458, p 2 = 0.168, and p 3 = 0.240. We first fix the block size to b = 9 patients and the number of interim analysis to J = 46. This leaves three remaining patients which are allocated using the probability allocation vector resulting after observing the outcomes of block 46. If we imagine that the trial's duration remains fixed at 2 years, a block size of nine is consistent with being able to observe the outcome after a 2 week delay. We subsequently review the method's performance for larger block sizes which are consistent with longer delays.
To assess the chance of a type I error being made under the global null, we report the family-wise error rateᾱ. This is ] − n 0,j ] η j k = 0 and as in Trippa et al. (2012) γ j = 10(j * b/T ) 0.75 and η j = 0.25(j * b/T )

Optimal allocation ratios
where n * k solve: For both NA and RSIHRA, we considered B = 0.1 and C = 0.80 and defined π k,j using an allocation function controlled by a tuning parameter γ and a γ = 2, σ = 1 smoothing parameter σ.

Modified controlled FLGI
k,j are the approximate FLGI probabilities in (3) the probability of rejecting at least one true null hypothesis. The Bonferroni method was used to account for multiple testing and ensure thatᾱ ≤ α, i.e., all hypothesis whose p-values p k are less than p k ≤ α K are rejected (with α = 0.05). To assess power, we calculate the probability of rejecting the null for the truly best treatment under scenario (ii). That is the probability of rejecting H 01 : p 0 = p 1 when p 1 = 0.458.
Hypothesis testing was performed using a normal cut-off value (when appropriate) and using an adjusted version of Fisher's exact test for comparing two binomial distributions. Fisher's test is conservative, i.e., its actual rejection rate is far below the nominal significance level. To make the designs comparable, we chose its cut-off value so as to achieve the nominal type I error rate (Villar et al., 2015).
Under the global null, p * is defined as the mean proportion of patients assigned to the control group, whereas under the alternative, p * is defined as the mean proportion assigned to treatment 1. We also report the ENS value if all patients  Table 3 displays the results from 5000 replications of the trial. As expected, under scenario (i) (top of Table 3) all the designs are equal in terms of ENS and p * (ENS is T × 0.29 = 120.93). All rules allocate on average the same proportion of patients to each treatment (close to 0.25). The exception to this is Trippa procedure for which control group allocation is always higher because it is matched to that of the best performing (by chance) treatment. Thompson sampling and the Gittins index-based designs have a variability in the allocation probabilities between 4-8 times larger than the other designs. The most interesting differences among these designs occur under scenario (ii). The value of ENS attained under fixed randomization in this case is close to the value reported in the study (which was of 121 complete remissions). We find that the Gittins index, FLGI, and CFLGI procedures increase the number of successes dramatically compared to fixed randomization (achieving 60, 58, and 45 more remissions on average, respectively). Trippa procedure also improves on fixed randomization (by approximately 29 remissions on average) and it achieves a higher power than the CFLGI, yet remissions attained are 10% lower.

Results
Of particular interest is the learn-versus-earn trade-off. The first four designs (fixed randomization, Trippa procedure, CFLGI, and Thompson sampling) achieve more than 60% power to reject H 01 , whereas the last four designs (RSIHRA, Neyman allocation, FLGI, and the Gittins index) fall below this power level. However, the designs that achieve the highest patient benefit (i.e., p * ≥ 0.80 and ENS ≥ 175) are the ones that have the lowest power. The designs that have higher ENS and p * also have a larger variability of allocation probabilities.
The optimal allocation ratio designs (Neyman allocation and RSIHRA) are constrained to attain at least a given probability to reject the global null for a test of homogeneity H 0 (i.e., to reject at least one H 0k with a two-sided test). In scenario (ii), they are required to attain the same level as fixed randomization to do that (i.e., C = 0.80 in Table 2). In fact, both designs produce levels substantially higher than that (0.99 and 0.97, respectively). To attain this, both optimal designs, after a minimal requirement of experimentation on all arms (i.e., B = 0.1 × b), skew allocation toward the best (i.e., k = 1) and worst arm (i.e., k = 2) to reject H 12 . Because of this, their probability of rejecting H 02 is lower than that of fixed randomization, as the increased probability to reject the global null H 0 is gained by increasing the chance of rejecting H 12 instead of H 02 .
For multi-arm trials, power depends on the particular alternative hypothesis considered. In Neosphere, the worst and best treatments were different from the control arm and this explains the reduced marginal power of Neyman allocation and RSIHRA to reject H 02 . It also explains the high marginal power attained by CFLGI, as its allocation quickly tends to assigning one in every four patients to control and the remaining three to the best experimental arm, which in Neosphere was k = 2. To illustrate this point, we consider two more cases where we compare the CFLGI to the optimal designs that target an ethical criterion and power simultaneously: Scenario (iii), where we let the control arm be the best treatment observed in Neosphere, and Scenario (iv) where control was assumed to be the worst treatment in Neosphere.
These results suggest that the power-ENS balance of the CFLGI is maintained except for the case in which the best arm is the control arm. In general, we expect the CFLGI to have a high marginal power to detect a difference between the best experimental treatment, if it exists, and control. In Web Appendix D, we discuss these extra scenarios and present a modification of the CFLGI rule that also performs well in the case where control is the best treatment.

Varying Block Size
We next evaluate the effect of changing the block size (or the number of interim analyses) on the ENS-power relation of the allocation methods in which power is not constrained. By definition when b = 1 the FLGI is identical to the fully sequential Gittins index rule (i.e., it favors ENS at the expense of power). Also, from Proposition 1 (see Web Appendix B) we know that for b = T (and J = 1) and all arms with identical initial priors, the FLGI group design is simply a fixed randomized design (i.e., it favors power at the expense of ENS). Hence, by varying b between 1 and T one can see how the FLGI approach trades off power for ENS within this range. This is illustrated in Figure 2. Note that the trial size is increased from 419 to 450 to allow several values of b to divide it exactly.
For the FLGI design, as the size of the block increases, its ENS decreases and its power increases. Conversely, for the other designs (CFLGI and Trippa procedure), the balance between power and ENS remains approximately constant when changing the block size. Block sizes with at least 80% power have ENS values below 180, while block sizes with ENS values above 180 have below 80% power. The designs that protect the allocation to control have a very similar performance in their power-ENS trade-off across different block sizes. For every block size, CFLGI achieves a higher ENS but a lower power than Trippa procedure. Cheng and Berry (2007) argue that the optimality of a particular design should be judged by taking into account the size of the trial (T ), the total patient population (N, or the "patient horizon") and, crucially, the ability of the trial's allocation rule to facilitate identification of the truly best treatment at its conclusion while treating patients in the trial as effectively as possible. We now assess, via simulation, the designs considered from this conceptual viewpoint by considering the total expected number of successes (ENS N ) as the sum of the first T patients in the trial (ENS T ) and the remaining N − T outside of the trial (ENS N−T ):

The FLGI and the Patient Population Size
We explore two rules to choose the 'best' treatment (denoted by k * ) at the end of the trial. Rule 1 is as follows: if there is no treatment with a statistically significant effect compared to the control group, then the control group is chosen (k * = 0). However, if one or more treatment is significantly effective, then the treatment with the largest posterior mean by the end of the trial is chosen. ENS N in equation (4) can then be evaluated for each allocation rule by calculating ENS T as described earlier in Section 4 and using where Pr(k * = k) is the probability treatment and k is selected at the end of the trial. Rule 2 simply chooses the treatment with the highest Gittins index by the end of the trial so that The probabilities of selecting a treatment as the best treatment under both rules are computed via simulation after 10 3 replicas of the trial of size T . Let K=3 and the true response rate vector be the observed values in Neo-Sphere, i.e., (p 0 , p 1 , p 2 , p 3 ) = (0.29,0.458,0.168,0.24). We consider two possible trial sizes: a small trial (T = 80) to represent a rare disease and a large trial (the original size of T =417) to represent a more common condition. Furthermore, we vary the population size as N = (1 + l)T for l ∈ {1, 2, 3, .., 10, 20, 40, 60, 80, 100}. This allows, in the rare disease case, for the total population to range from 160 (equal numbers inside and outside of the trial) to 8080 (a hundred times as many outside as inside). We assess the relative gap between ENS N using its upper bound (if all patients had been assigned to the best treatment from the start) and the value calculated from equation (4) (for a given trial allocation and treatment selection rule) using Figure 3 (top) shows the relative gap (7) as a function of N under rule 1 for the small trial (top-left) and large trial (top-right). These results indicate that all rules, except for the Gittins index and FLGI, exhibit a type of asymptotic convergence similar to the one reported in Cheng and Berry (2007). That is, as the population that benefits from the trial grows (i.e., as N − T → ∞), their suboptimality gap tends to zero. However, for FLGI and the Gittins index approach the opposite is true. For smaller population sizes the gap is smallest, and it grows as a function of N. Moreover, the gap of the Gittins index and the FLGI rules is always above the other designs.
The poor performance of the uncontrolled Gittins indexbased rules is caused by the use of a classical significance test. Such tests are appropriate when large enough numbers of patients are assigned to all arms and are thus perfectly suited to rules such as fixed randomization. If a truly best treatment exists, the Gittins index-based rules will assign highly unequal numbers across trial arms. Although this imbalance is a positive indicator of a superior treatment effect, it (paradoxically) leads to a low power. For this reason, Figure 3 (bottom) shows how the relative gap of the Gittins index, FLGI, and CFLGI decreases toward 0 under selection rule 2 for both trial sizes. The results for the other allocation methods using selection rule 1 are shown for comparison. Under rule 2, the convergence of the gap toward 0 as T and N − T tend to infinity holds and moreover, the advantage of these rules over the other designs holds for all N.
These results illustrate that, for rare diseases (where there is little power to detect a statistically significant treatment effect) bandit-based designs are highly attractive. Furthermore, the value of such designs lies within the fact that they can be effectively used for treatment selection (i.e., using rule 2). Note that CFLGI performs well under either selection rule.

Discussion
Despite its optimality, and its original clinical motivation, the Gittins index rule has never been used in practice to conduct an actual clinical trial. In this article, we focus on three of the practical barriers to its use, namely: (1) insufficient statistical power (when computed using traditional hypothesis testing procedures); (2) the need for instantly observed treatment outcomes; and; (3) a lack of randomization to provide a basis for inference. To address these barriers, we propose the forward looking Gittins index algorithm, a randomized group sequential allocation procedure based on the index solution to the classic infinite horizon multi-arm bandit problem.
Simulations confirm that the FLGI procedure, combined with a protected control group allocation, enables adaptively block-randomized clinical trials that are as follows: statistically conservative when no treatment with additional benefit over control exists; statistically powerful when one does; and highly ethical in terms of patient benefit. These characteristics were found to persist when the block size was stochastic, instead of being fixed (results not shown). As with all adaptive allocation procedures, the feasibility of our approach is diminished when there are significant delays in observing patient responses.
The FLGI rule, although independently developed, resembles the policy improvement algorithm (see e.g., Howard, 1960) used to approximate computational expensive optimal value functions in Markov Decision problems. Semirandomized rules based on the Gittins index have been proposed by Glazebrook (1980). These rules overcome the power limitation by continuing to allocate suboptimal arm with a positive, though decreasing, probability. However, they are not expressed in terms of allocation probabilities, and thus do not constitute a fully randomized procedure in the sense that we have outlined here. Gittins indices and analogous optimal-  (21,41,61,81,101)T } in 10 3 replicas. Top: simulations using power computations based on statistical significance. Bottom: simulations using the GI rule to make a final decision for FLGI, CFLGI, and GI. ity results have been derived for a variety of endpoints. Therefore, a natural extension of this work can consider other endpoints, e.g., exponentially distributed populations (see Gittins et al., 2011). Also, extending the FLGI rule for the case in which covariates are available, as in Yang and Zhu (2002), is very relevant.
Finally, the FLGI rule is especially useful for trials in which the main goal is the selection of a superior treatment for continuing further study (for example in Phase II or rare disease settings). This is because the FLGI (just as the Gittins Index) skews the allocation toward one of the treatments, selecting the best treatment within the trial when it exists, although it is often then unable to declare it effective at the desired significance level. This highlights the need to develop hypothesis testing procedures which complement bandit designs. High power may be less relevant than the goal of having a large probability of selecting a superior within the trial and therefore treating patients better. If power is a concern, the controlled FLGI rule can be applied to gain both power and ethical advantages.
Adaptive designs are often criticized in terms of their frequentist operating characteristics. We have found that the FLGI approach controls the type one error conservatively and that the suboptimal arms (those discarded within the trial by the FLGI rule) also have their success rates slightly underestimated (Villar et al., 2015). Further research is needed to develop testing and estimation procedures that maintain the correct frequentist operating characteristics.
The motivation of our proposed algorithm was in the setting of clinical trials, but its use is not limited to medical applications. The heuristic applies to sequential allocation problems more generally. In particular, the FLGI can be used as a heuristic to solve bandit problems with more than two actions (See Web Appendix C and E) or with random block sizes.

Supplementary Web Appendix
Web Appendices A-D and additional Tables referenced in Sections 2, 3, 3.1, 4.2, 4.3, and 5 are available with this article at the Biometrics website on Wiley Online Library. Example code and instructions for running it are also available as a web supplement (same website).