A response‐adaptive randomization procedure for multi‐armed clinical trials with normally distributed outcomes

Abstract We propose a novel response‐adaptive randomization procedure for multi‐armed trials with continuous outcomes that are assumed to be normally distributed. Our proposed rule is non‐myopic, and oriented toward a patient benefit objective, yet maintains computational feasibility. We derive our response‐adaptive algorithm based on the Gittins index for the multi‐armed bandit problem, as a modification of the method first introduced in Villar et al. (Biometrics, 71, pp. 969‐978). The resulting procedure can be implemented under the assumption of both known or unknown variance. We illustrate the proposed procedure by simulations in the context of phase II cancer trials. Our results show that, in a multi‐armed setting, there are efficiency and patient benefit gains of using a response‐adaptive allocation procedure with a continuous endpoint instead of a binary one. These gains persist even if an anticipated low rate of missing data due to deaths, dropouts, or complete responses is imputed online through a procedure first introduced in this paper. Additionally, we discuss how there are response‐adaptive designs that outperform the traditional equal randomized design both in terms of efficiency and patient benefit measures in the multi‐armed trial context.

normally distributed endpoints with a known variance.
The MABP in this case involves a multi-armed clinical trial that will test the effectiveness of K experimental treatments against a control treatment on a sample of T patients, with K and T fixed and known in advance. Patients are labeled by t (t = 1, . . . , T ) and treatments by k (k = 0, . . . , K), where k = 0 denotes the control. The response of patient t allocated to arm k is a random variable denoted by Y k,t and assumed to follow a normal distribution Y k,t ∼ N (µ k , σ 2 k ). Without loss of generality, we also assume that a larger response is preferable and that σ 2 k is known.
In order to derive the FLGI rule, we first need to obtain the GI for a normally distributed variable and the MABP associated with this trial design problem. For this purpose, we assume the following. (i) Each unknown parameter µ k has a prior distribution π k,0 at the start of the trial (before any observation has been made) which we take to be the normal prior ). Note that the form of the prior when both µ k and σ 2 k are unknown is provided below. (ii) Patients enter the trial one-by-one and responses are observed immediately after treatment. We will remove these assumptions when we formulate the FLGI rule. (iii) Only one treatment can be allocated per patient and we let a r k,t be a binary indicator variable denoting whether patient t + 1 is assigned to treatment k for patient allocation rule r or not, given the information available on all treatments. (iv) Given the conjugacy of the prior and normally distributed responses, prior distributions are converted into normal posterior distributions for each µ k via Bayes' Theorem. After treating patient t, if n k,t responses from treatment k have been observed (each denoted by y k,i with i = (1, . . . , n k,t ) and n k,t t), then the posterior distribution of µ k at time t is π k,t (µ k |y k,1 , . . . , y k,n k,t ) ∼ N n k,t y k,t +n 0 k µ 0 k n k,t +n 0 k , σ 2 k n k,t +n 0 k by Bayes' Theorem, where y k,t = 1 n k,t n k,t i=1 y k,i is the sample mean and n 0 k is the implicit sample size from the prior information (Spiegelhalter et al., 2004, p. 62). The posterior distribution, π k,t , can be identified by the parametersỹ k,t (posterior mean) and n 0 k +n k,t , which we subsequently refer to as the state (of the bandit) (Gittins et al., 2011). Note that when the variance is unknown, an additional parameters 2 k,t , denoting the posterior variance of patient t on arm k, is required to identify π k,t and in this case, we need to specify a joint prior distribution for µ k and σ 2 k at the start of the trial. We take this to be the normal-inverse-gamma distribution (where the variance follows an inverse-gamma distribution and the mean, conditional on the variance, has a normal distribution). Consequently, the marginal prior distribution for µ k has a Student's t-distribution. When we observe an outcome y k,t+1 from patient t + 1 on arm k, the state (ỹ k,t ,s k,t , n 0 k + n k,t ) is updated as follows The MABP is to find a patient allocation rule r that attains the maximum expected patients' response given the initial information about the treatments before the start of the trial. Mathematically, this is expressed as where x k,t = (y k,t , n k,t , µ 0 k , n 0 k ),x 0 = {x k,0 } K k=0 is the initial joint state with all the prior parameters, R is the set of admissible allocation rules, E r [·] denotes expectation under allocation rule r, and 0 d < 1 is a discount factor. In MABPs, rewards are geometrically discounted so that an infinite horizon can be considered, i.e. patient t's response yields a Supporting Information for "A RAR Procedure for Multi-Armed Clinical Trials with Normally Distributed Outcomes" 3 reward of d t Y k,t for some k. In practice, a solution that depends on d, such as the GI, can be adapted to solve an undiscounted problem with a specific finite horizon, as explained in Edwards et al. (2017, Definition 6.6).
The exact solution to (2), obtained via dynamic programming, uses a backward induction algorithm which becomes computationally infeasible very quickly as T and K grow. The GI solution, first introduced by Gittins and Jones (1979), eliminates this computational infeasibility by ensuring that the optimal solution to (2) can be obtained by simply allocating every patient to the arm with the highest GI. Similarly to equation (1) in the main paper for the unknown variance case, the GIs, G(ỹ k,t , σ k , n k,t ), for the known variance case in (2) can be expressed as where G(0, 1, n 0 k +n k,t , d) denotes the GI value of a standardized bandit problem with posterior mean 0, standard deviation 1, implicit sample size n 0 k , n k,t observations and discount factor d (Gittins et al., 2011, Theorem 7.13). These were first computed in Jones (1975). Table 1 shows indices corresponding to the unknown variance case, as used in the main paper, based on those presented in Gittins et al. (2011, Table 8.3).
We implement the solution in (3) at a very low computational cost by calculating the values of G(0, 1, n 0 k + n k,t , d) in advance and interpolating from the tables printed in Gittins et al. (2011, pp. 261-262). Details on how to compute these indices using value iteration can be found in Gittins et al. (2011, Chapters 7 and 8). Using (3) and the GI rule, we can compute the FLGI probabilities for normally distributed endpoints (with known variance) using equation (3) in Villar et al. (2015). We now assume that instead of enrolling patients one-by-one, patients are enrolled in groups of size b over J stages, so that J × b = T . Our response-adaptive rule will sequentially randomize the next b patients among the K + 1 treatments at stage j (j = 1, . . . , J) given the data up to and including block j − 1 according to what the GI rule would do.

Example
We now illustrate the rule's implementation using an example for the case of known variances.
From equation (3), setting d = 0.995 and using Table 8.1 in Gittins et al. (2011), the GI for the control treatment is G 0 (0.9, 1, 2) = 0.9 + 0.20137 3(1−0.995) Given that the control treatment has the maximum GI, the first patient of the second block (i.e. patient 3) is allocated to the control treatment with probability (w.p.) 1 since there is only one optimal action possible at this point. If we denote the random outcome of this patient by Y 0,3 , then the updated state for the control treatment is ( Y 0,3 , n 0 0 + n 0,3 ) = 0−0.4+3.1+Y 0,3 4 , 4 . Thus, the corresponding index for the control treatment can be expressed Supporting Information for "A RAR Procedure for Multi-Armed Clinical Trials with Normally Distributed Outcomes" 5 as a function of the random outcome from patient three as follows: G 0 ( Y 0,3 , 1, 3) = Y 0,3 +2.7 4 + G 0 (0, 1, 4, 0.995) = Y 0,3 4 + 1.4669.
For the experimental treatment, we have no new information and so the corresponding index remains unchanged at 1.8175. According to the GI rule, it will be optimal to allocate the control treatment to the second patient of the second block if and only if G 0 ( Y 0,3 , 1, 3) > G 0 (0, 1, 0) = 1.8175, that is, if Y 0,3 > 1.4024. Since Y 0,3 ∼ N (0.9, 1), we expect this to happen w.p. Pr(Y 0,3 > 1.4024) = 0.3077. If Y 0,3 < 1.4024, which happens w.p. 0.6923, then G 0 ( Y 0,3 , 1, 3) < G 1 (0, 1, 0) and the second patient of the second block is optimally allocated to the experimental treatment. Notice that if Y 0,3 = 1.4024, then there is a tie in the index values and it is equally optimal to allocate any of the two treatments. Although theoretically we expect this to happen w.p. 0 (since we are dealing with a continuous distribution), in practice this is possible and if it were to happen, we would simply randomize w.p. 0.5.
Hence, the probability of a patient receiving either the control or experimental treatment when using the normal FLGI procedure in this block is 1+1×Pr(Y 0,3 >1.4024) 2 = 0.6538 and 0+1×Pr(Y 0,3 <1.4024) 2 = 0.3462, respectively. Figure 1 illustrates how the FLGI probabilities for block two, given the data in block one, are computed via a probability tree.

Web Appendix B: Effect of Discount Factor on FLGI Performance
A practical consideration for our design is the choice of discount factor, d. We recommend choosing d to be close to that obtained when applying the formula suggested by Wang (1991), namely, d = 1 − 1/T , where T is the trial size. Here, we discuss the implications of not following this recommendation on the performance of the FLGI (with known variance) by presenting results corresponding to d = 0, 0.5 and 0.99 in Table 2. Note that the results for d = 0.995 (the discount factor used throughout the main paper) are shown in (i) of Table 3. When d = 0, the design is analogous to a fully myopic policy which treats every patient as if they are the last one in the trial. In contrast, the closer d is to 1, the greater the influence that potential responses from future participants have on allocation decisions made earlier in the trial, that is, the more "forward looking" the design will be. Thus, we expect the patient benefit measures to increase with d (up to a limit determined by the actual population size), as illustrated in Table 2. In particular, Table 2 shows that as d increases from 0 to 0.995 for b = 1, E(p * ) increases by 0.164, which is equivalent to 11 more patients receiving the superior arm, and the relative ET O increases by 17.77%. As a result of the greater imbalance between the treatment arms for larger d, the bias of the treatment effect estimator (under H 1 ) is also increased.
Interestingly, for smaller d, we observe that the patient benefit measures increase (up to around b = 9) followed by a decrease. This is due to an interaction between the discount factor and block size, whereby the increase in block size counteracts the myopic effect of a small d by forcing learning and consequently improving patient benefit. However, as the block size continues to grow, the effect of the design becoming more balanced supersedes the effect of the discount factor, causing the patient benefit to now reduce. Therefore, when choosing d, it is important to consider which block size will be used.
In terms of the power of the design, it increases somewhat with the size of d as illustrated by Table 2 which shows that the power exhibited for the FLGI when d = 0 and b = 1 is 0.213 compared to 0.229 when d = 0.995. This makes sense because increasing d from a value that is much smaller than its recommendation for a fixed T reduces the myopic nature of the rule, meaning it will explore more of the arms (thus increasing power) and make better choices (also increasing patient benefit).
The discount factor also affects the variability of the expected allocations, which decreases considerably with the value of d under both H 0 and H 1 . For example, Table 2 shows that under H 1 , the standard deviation (s.d.) of E(p * ) when d = 0 and b = 1 is 0.43, which is 2.7 times larger than the corresponding s.d. when d = 0.995. Given that allocations under index based designs (and response-adaptive designs more generally) can already be very variable, it does not make sense to choose a discount factor which exacerbates this even further.
A further practical drawback of using a discount factor that is too small is that it will increase the likelihood of the design allocating all patients to only one of the treatments (due to an under exploration). The number of times this occurred out of the 50,000 trial realizations is reported in the "Discarded" column of Table 2. For example, when d = 0 and b = 1, more than half of the 50,000 trial realizations under H 0 (namely 25,621) resulted in this extreme allocation. Therefore, for the purpose of calculating the test statistic (and hence power) and bias values in these cases, we randomly sampled an observation from the distribution corresponding to the missing arm instead. In contrast, when d = 0.995, this problem did not occur in any of the 50,000 trial realizations (and similarly when d = 0.99).
Note that all of the aforementioned differences are most pronounced for smaller block sizes (which is when the design is most adaptive) since as the block size grows and the FLGI design becomes more balanced, the respective performance measures eventually converge, irrespective of the value of d.
Overall, provided that d is near to the recommendation suggested by Wang (1991), the performance of the FLGI will be similar -as illustrated by the results for d = 0.99 (Table   2) and d = 0.995 (Table 3). However, choosing d to be too small in relation to T can alter the behaviour of the design significantly. Moreover, if we were to use this design in a rare disease context, where we envisage it would be best suited, d should be chosen to be large enough so that we account for all of the patient outcomes in the adaptations and hence ensure patient benefit for all.

Web Appendix C: Effect of Prior Information on FLGI Performance
In this Web Appendix, we investigate how sensitive the FLGI is to the choice of prior on the location parameter µ k when the variance is assumed known. Ultimately, the choice of prior on µ k determines which GI we start the allocation rule with. The minimum amount of information we can assume, a priori, in order to initiate the GI policy is n 0 k = 1 (known variance case) and n 0 k = 2 (unknown variance case) since the GI is undefined for n 0 k = 0.
This gives rise to a normal prior with large variance (see Figure 2) which can be used as a so-called 'non-informative' prior (Spiegelhalter et al., 2004, p. 62). All results in the main paper correspond to this 'non-informative' prior so that we can report the effects on patient response and other relevant statistical properties of the FLGI alone, without the influence of additional prior information. However, we now turn our attention to using different priors in conjunction with the FLGI. We use the results for the 'non-informative' prior (in (i) of Table 3) as a reference, and therefore refer to it as a 'reference' prior from hereon.
Taking the two-armed example of Section 3.2 in the main paper (but now assuming known variance), we follow the suggestion provided in Spiegelhalter et al. (2004, Chapter 5) and consider two archetypal priors on µ 1 , namely, the sceptical and enthusiastic prior (with the reference prior on µ 0 ).
The sceptical prior attempts to formalize the belief that large treatment differences are unlikely. In particular, the sceptical normal prior on µ 1 is centered around the ( where n 0 1 is the implicit (prior) sample size and z 0.05 = −1.645 is the fifth percentile of the standard normal distribution. Rearranging equation (4) gives n 0 equivalent to having eight patients' worth of information (with null mean) available at the start of the trial, that is, approximately 11% of the trial sample size expressing scepticism and showing no treatment difference. The performance measures of our design when starting with this prior on the experimental arm are shown in (ii) of Table 3 for all block sizes, b.
The enthusiastic prior, on the other hand, is centred on the alternative hypothesis value of 0.529 (with the same variance as the sceptical prior) and specifies that there is little evidence of no treatment effect a priori, i.e. there is a 5% chance of observing a value less than the null mean of 0.155. This corresponds to the following normal prior distribution , which is equivalent to having already observed eight 'enthusiastic' responses before the start of the trial. The corresponding results when starting with this prior on the experimental arm are displayed in (iii) of Table 3.

about here.] Conclusions
The main conclusions to draw from these experiments are that when using the FLGI in the known variance case with a sceptical prior on the experimental arm, the power of the design increases whilst the patient benefit measures decrease relative to the corresponding results when starting with the reference prior. This is what we would expect to observe because the sceptical prior implies that there is a 0.95 probability that µ 1 lies below 0.529 (as depicted in Figure 2) which is incorrect under H 1 , and as such it provides the FLGI algorithm with a 'false start'. Thus, it takes longer for the design to correctly identify the best arm, resulting in fewer patients allocated to the superior arm but a larger power due to less imbalance.
In contrast, when starting with an enthusiastic prior on the experimental arm, the reverse happens (as shown in (iii) of Table 3); the power decreases whilst the patient benefit measures increase (relative to starting with the reference prior). Again, this is not surprising because the enthusiastic prior specifies that the most likely value of µ 1 is 0.529 (as illustrated in Figure   2). Under H 1 , this is correct and so it gives the algorithm a 'head start' in the right direction meaning it identifies the superior arm quicker. Thus, less allocations are made to the control arm resulting in more imbalance and hence reduced power. Under H 0 , however, this prior specification on the experimental arm is incorrect and so the FLGI incorrectly allocates fewer patients to the control arm, as observed in (iii) of Table 3 (where the control arm is taken to be the 'superior' arm under H 0 ). This explains why only ≈ 33% of patients in the trial are allocated to the control arm for all block sizes under H 0 . Fewer observations on the control arm leads to an underestimation ofμ 0 and consequently the treatment effect estimator under H 0 exhibits bias. It is also worth noting that the variability in the allocations decreases when using the enthusiastic prior (relative to the reference prior) since, under H 1 , the observed data and prior information match which reduces the uncertainty of the allocations.
Overall, our recommendation is to be very cautious when incorporating prior information into bandit-based designs such as the FLGI because it influences the speed at which the design updates and favors an arm (depending on how informative the prior is). Since these designs are so dynamic anyway, there is not as much to gain from using prior information as there may be with less responsive designs. If the prior specification is correct, then the incoming data will further enhance the effect of the prior and the design will favor the superior arm sooner, whereas if the prior is misspecified, the bandit may spend more time in the exploration phase or degenerate to allocating all patients to one arm. However, it is likely that the incoming data during the trial will eventually dilute the effect of the misspecified prior. How long the design takes to correct for the misspecification depends on the value of n 0 k ; the greater its value, the more influence the prior will have. Therefore, if one wishes to use prior information in conjunction with the FLGI, we suggest setting a small value for n 0 k (as we have in the main paper).

Figure 2.
Sceptical and enthusiastic prior densities with the reference prior depicted in black. The sceptics' probability that the true mean response is greater than 0.529 (the alternative value) is 0.05, shown by the purple shaded region. The enthusiasts' probability that the true mean response is less than 0.155 (the null value) is also 0.05, shown by the green shaded region. NB By 'reference' prior, we refer to the prior containing the least amount of information in order to initiate the Gittins index policy (i.e. a normal prior with large variance).  Table 1 Gittins indices for a normal reward process with unknown variance where d and n 0 k + n k,t denote the discount factor and total amount of information, respectively. These values are based on those reported in Table 8.3 of Gittins et al. (2011).  Table 2 The effect of altering the discount factor, d, on the performance of the FLGI for a two-armed trial when σ 2 k = 0.64 2 is assumed known and T = 72, averaged over 50,000 trial replications. NB The "Discarded" column reports the number of trials that resulted in an extreme allocation with all patients being allocated to only one arm.  Table 3 The effect of using archetypal priors on the performance of the FLGI for a two-armed trial when σ 2 k = 0.64 2 is assumed known, T = 72 and d = 0.995, averaged over 50,000 trial replications.   Table 4 Comparison of performance measures for a three-armed trial using different designs when the variance is assumed unknown and T = 120, averaged over 50,000 trial replications. Note that the true variance of the response is σ 2 k = 0.346 2 for k ∈ {0, 1, 2}.