Estimation of treatment effects following a sequential trial of multiple treatments

When a clinical trial is subject to a series of interim analyses as a result of which the study may be terminated or modified, final frequentist analyses need to take account of the design used. Failure to do so may result in overstated levels of significance, biased effect estimates and confidence intervals with inadequate coverage probabilities. A wide variety of valid methods of frequentist analysis have been devised for sequential designs comparing a single experimental treatment with a single control treatment. It is less clear how to perform the final analysis of a sequential or adaptive design applied in a more complex setting, for example, to determine which treatment or set of treatments amongst several candidates should be recommended. This article has been motivated by consideration of a trial in which four treatments for sepsis are to be compared, with interim analyses allowing the dropping of treatments or termination of the trial to declare a single winner or to conclude that there is little difference between the treatments that remain. The approach taken is based on the method of Rao‐Blackwellization which enhances the accuracy of unbiased estimates available from the first interim analysis by taking their conditional expectations given final sufficient statistics. Analytic approaches to determine such expectations are difficult and specific to the details of the design: instead “reverse simulations” are conducted to construct replicate realizations of the first interim analysis from the final test statistics. The method also provides approximate confidence intervals for the differences between treatments.

worse than any other according to predetermined criteria is eliminated from the trial. If only one treatment remains, it is declared the winner and the trial stops. If all remaining treatments are sufficiently similar to one another, they are declared joint winners and the trial stops. Otherwise the trial continues to the next interim analysis. At the end of such a trial, how should the differences between the performances of the treatments be estimated, bearing in mind the potential biases introduced by the elimination and stopping rules? 2 Estimation following a sequential trial has a rich statistical literature. For the case of trials comparing two treatments, methods may be based on orderings of the final sample space [3][4][5][6] or on direct adjustment of the maximum likelihood estimate to reduce its bias. 7 Overviews of alternative approaches are available. 8,9 For adaptive designs, which are not entirely predefined, estimation methods can be devised by extending the ordering approach 10,11 or through shrinkage approaches. 12,13 In this article, an approach based on Rao-Blackwellization 14,15 will be developed. This is a method that has been adopted by previous authors for certain specific designs. 16,17 It is difficult to generalize approaches based on orderings of the final sample space to trials of multiple treatments because there are so many possible final samples and it becomes unclear how to determine which provide stronger evidence favoring a given alternative over the null than the sample observed. Direct adjustment of maximum likelihood estimates depends on knowledge of the distribution of the final sample statistics around the stopping boundary. While this can be characterized for a comparison of two treatments that relies on a single test statistic, it is far more challenging to achieve for multiple treatments compared in respect of several pairwise test statistics. Here the Rao-Blackwellization approach will be developed. This is based on the expected value of an unbiased estimate computed at the first interim analysis (and thus unaffected by any stopping rules), conditional on sufficient statistics computed at the end of the study. Rather than finding this test statistic and its SE analytically, it will be determined by reverse simulation. That is, starting with the final values of the numbers of patients and the numbers of successes for each treatment (and when present, within each stratum), hypergeometric sampling will be used to create possible samples at each earlier interim analysis until that at the first interim has been recreated. Only those sequences that are consistent with continuation to the observed end of the sequential procedure are accepted. The mean and variance of unbiased estimates from each acceptable replicate simulated first interim analysis are then used to provide unbiased estimates and approximate confidence intervals that allow for the sequential nature of the design.
The approach developed has the potential for implementation following a wide range of multiple treatment trials and flexible adaptive designs. It is often easier to work backwards from the end of the trial and determine which sequences of data would have led to continuation to the final sample, than to project such sequences from the outset. Much of the development and evaluation of the method will be made in the context of a conventional sequential comparison of just two treatments because that setting is simpler analytically and computationally, and because it allows comparisons to be made with more established methods.
In the next section the trial design described in Magaret et al 1 is reviewed. Instead of using the elimination and stopping rules proposed in that article, the performance of alternative rules based on the triangular test 8 is examined. It must be emphasized that this design serves here only as an illustration of the new estimation approach. The latter could be applied to a wide range of multiple treatment designs and indeed other forms of flexible adaptive design. In Section 3, the simple comparative triangular design that forms the basis of the four treatment evaluation is extracted and examined in isolation, and in Section 4, naïve and orderings-based analyses are developed in the two-treatment context, together with two forms of the Rao-Blackwellization approach. The new method is applied to simulated data from the four treatment design in Section 5, and to simulated data from a simpler design for a smaller trial in Section 6. Section 7 draws together conclusions from this work.

A DESIGN FOR THE COMPARISON OF FOUR TREATMENTS
The design introduced by Magaret et al 1 comprised up to four successive analyses based on constant nominal -levels.
Here the overall structure of that design is retained but a different approach is taken to the elimination and stopping rules: one which will allow the extraction of a simple triangular test 8 for examination in the central sections of this article. The design examined in this article was accepted for use by the investigators and was written into the protocol. For reasons unconnected with the trial design or any other statistical considerations, the trial did not actually run. Treatment effects are expressed in terms of odds ratios for success. The value 1.5 is taken to be of clinical importance. The probability of success (survival to 28 days) for a patient receiving treatment T i is denoted by p i , i = 1,… , 4. The log-odds ratio for treatment The design seeks to satisfy the following requirements. Type I error requirement: For any treatment T i , if there is another treatment T j , i ≠ j such that p i = p j , then the probability that the trial finds T i to be the sole winner is to be ≤0.025. Power requirement: For any pair of treatments T i and T j , if T i is superior to T j to the extent that ij = log(1.5), then the probability that T j would be eliminated from the study is to be ≥0.90.
Interim analyses occur whenever 36 new patient outcomes become available on each of the treatments remaining in the study. The maximum sample size is set at 2772. Patient responses and interim analyses continue until the trial stopping rules are satisfied or else it is impossible to assign 36 more patients to all remaining treatments within this quota. The probability that not all treatment comparisons will be resolved after 2772 patient responses have been observed is small. If all four treatments were to remain in the trial, the maximum number of interim analyses would be about 20: more could occur if treatments were eliminated.
At the kth interim analysis, every pair of treatments T i and T j will be compared in terms of the statistics Z ijk and V ijk where Here n ik denotes the number of patient responses available for patients on Treatment T i , and S ik the number of those who have survived to Day 28. For a stratified version of the procedure, the statistics shown are computed separately within each stratum, and then summed over strata to provide the values of Z ijk and V ijk to be used. Notice that alternative and equivalent formulations for these two statistics are: and V = n n n + n The first shows that Z ijk is the difference between the proportions of successes on treatments T i and T j , multiplied by a factor that increases as sample size increases and is equal to 1 ∕ 4 (n ik + n jk ) when n ik = n jk . The second shows that V ijk is the product of the overall success rate and the overall failure rate on the two treatments multiplied by that same factor. It can be shown that V ijk always increases between interim analyses.
At the kth interim analysis, it is concluded that T i is better than T j if Z ijk ≥ 10.90266 + 0.12380V ijk , no different from T j if Z ijk ∈ (10.90266 − 0.37140V ijk , −10.90266 + 0.37140V ijk ), and worse than T j if Z ijk ≤ −10.90266 − 0.12380V ijk . If the interval used to judge no difference is empty because the left-hand limit is larger than the right-hand limit, then the no difference conclusion is not possible. Whenever one treatment is found to be worse than another according to this criterion, that treatment is eliminated from the trial. Randomization continues between the remaining treatments, and interim analyses continue to take place whenever 36 new outcomes have become available for each remaining treatment. The trial stops when only one treatment remains, or when all remaining treatments are found to be no different from one another. For the purposes of the simulations conducted here, the trial also stops if a further interim analysis would require the total number of patients to exceed 2772, although in practice investigators might choose an alternative strategy as discussed later in this section.
The elimination and stopping rules, as they relate to a comparison between one pair of treatments, are shown in Figure 1. Each interim analysis is governed by discs shown on the boundaries, and at the kth interim analysis the value of Z ijk is plotted against that of V ijk , and the conclusion indicated is drawn. The design has been developed from a double triangular design devised to compare two experimental treatments. 8,18 The boundaries are computed to satisfy the type I error and power requirements mentioned above, interpreted for the simple case of two treatments. Computation is based on the SEQ function of SAS, following References 19 and 20 but using the four boundary option of SEQ. The increment in information between interim analyses for this double triangular test is V = 4.40337. When p 1 = 0.40 and p 2 = 0.50 or when p 1 = 0.50 and p 2 = 0.60 (both corresponding to an odds ratio of 1.5), this corresponds to an increase in sample size between interim analyses of 35.58 per treatment, which is rounded up to 36 in this application.
Applied to the case of four treatments, the type I error and power requirements specified at the beginning of this section are valid. The probability that T 1 is declared the sole winner, when in fact p 1 = p 2 , is greatest when the success rates on T 3 and T 4 are both zero so that there is no chance of them being declared either sole or joint winners. There would also be a negligible chance that they would be declared no different from T 1 or from T 2 or from both. In this circumstance, the probability that T 1 would be declared the sole winner is therefore equal to the probability of T 1 being found better than T 2 in the double triangular test when 12 = 0: that is 0.025. Furthermore, the probability that T 1 is eliminated, when F I G U R E 1 The elimination and stopping rule for a single pair of treatments 12 = log(1.5) is least when the success rates on T 3 and T 4 are both zero so that there is no chance that T 1 would be eliminated relative to them. In this circumstance, the probability that T 1 would be eliminated is therefore equal to the probability of T 2 being found better than T 1 in the double triangular test when 12 = log(1.5): that is 0.900.
Properties of the design estimated from million-fold simulations, are shown in Table 1 below. In each of the Cases 1 to 12, one set of treatments share a high success rate and the rest share a low rate, with the odds ratio between the two rates being 1.5. In Cases 13 to 16, all success rates are equal. Also shown are "Mixed Cases". For these, we imagine that the trial is conducted at four centers each recruiting equal numbers of patients. In the simulations, the 36 patients recruited to each treatment for each new interim analysis are distributed amongst the centers at random. The four centers in the mixed cases each have a different set of success probabilities, namely the four sets shown in the cases above. In the simulations for the mixed cases, the statistics Z and V given in (1) are stratified for center: that is the four within-center values of Z and V are calculated and then summed to provide the values to be compared with the stopping boundaries.
In Cases 1 to 4 and Mixed Case I, the probability that T 4 is correctly eliminated exceeds 0.90, as specified in the power requirement. This is true for T 2 and T 3 as well, although these results are not shown: in general the full results reflect the symmetry of each scenario. Treatment T 1 is correctly selected with a probability exceeding 0.80: this is a desirable feature, although not part of the formal specification. In Cases 5 to 8 and Mixed Case II, the probability of wrongly declaring T 1 to be the winner is no more than 0.026, (essentially) satisfying the type I error requirement. The probability of eliminating T 4 is well above the value of 0.90 of the power requirement. The probability of correctly declaring T 1 and T 2 to be joint winners is above 0.90, except for Case 8 where it is 0.885. In Cases 9 to 12 and Mixed Case III, the probability that T 1 wins is 0.005 and the probability that T 4 is eliminated is greater than 0.975. The probability of correctly identifying the three joint winners is greater than 0.814. Finally, in Cases 13 to 16 and Mixed Case IV, the probability that T 1 wins is 0.002 or less in all cases. The probability of correctly identifying all four treatments as no different is greater than 0.748, except for case 16 where it is 0.591.
Average total sample sizes at termination are around 1400 to 2400. Sample sizes are smaller when success probabilities are close to 1 ∕ 2 , and larger when they are close to 1 or to 0. They are also smaller when there is a single treatment that is more efficacious than the others, or when there are two good treatments. Cases where three or all four treatments are equally efficacious require larger sample sizes before a conclusion is reached. Ethically, this is sound, as if all treatments are the same, no group of patients is being disadvantaged by being in the trial. The full results show that sample sizes on poor treatments tend to be small and those on good treatments to be large, indicating the effectiveness of eliminating poor treatments. The percentage of inconclusive trials (trials where after 2772 patients uncertainty between at least one pair of treatments remains) was 26.6% in Case 16. In all other cases, such percentages are small or negligible. If the trial ends without either identifying a single winner or concluding that there is no difference between the remaining treatments, then investigators can accept the result available, or else recruit additional patients to force a conclusion.
The construction of the decision rules of the design guarantees that it is not possible to declare two treatments to be no different from one another during the first six interim analyses (see Figure 1). It is possible to stop at any analysis to conclude that one of the treatments is better than all of the others, but the evidence has to be very strong. Minimal evidence for T 2 to be eliminated relative to T 1 at the first interim analysis requires 23 successes out of 36 on T 1 and none on T 2 : the corresponding one-sided nominal P-value lies well below 0.00001. In fact, under most realistic scenarios, the probability of stopping at one of first three interim analyses is negligible.  It can be seen that the procedure presented here achieves the type I error and power requirements specified, and has other desirable properties in terms of high probabilities of appropriate conclusions and relatively low expected sample sizes. It must be stressed that these interim analyses are very simple to carry out. The following information on all patients randomized 28 days ago or earlier is all that is needed: Patient identification number; Treatment center and any other baseline stratification factors; Date of randomization; Treatment arm (T 1 , T 2 , T 3 or T 4 ); and Survived to Day 28 (YES or NO). All but the last are available for a month before the patient is to be included in interim analyses. More extensive reviews of the data might be planned, perhaps to coincide with every fourth or every fifth interim analysis. Interim analyses are to be conducted whenever the average number of patient responses per remaining treatment collected since the previous interim analysis reaches 36. Ideally, this should be 36 patients per treatment, but the formulae given at (1) can be used when sample sizes are unequal, and the accuracy will remain good provided that sample sizes per treatment are approximately equal. The method is also likely to be forgiving of slight slippage from an average of exactly 36 new patients per treatment.

CONVENTIONAL POST-TRIAL ESTIMATION FOR A SIMPLE TRIANGULAR TEST
Now consider a comparison between just two treatments, T 1 and T 2 . A series of up to 20 interim analyses are conducted, at the kth of which the statistics Z 12k and V 12k defined in (1) will be computed. Here, they will be denoted simply as Z k and V k , and the log-odds ratio 12 by . The trial will be stopped with the conclusion that T 1 is better than T 2 if Z k ≥ 10.93898 + 0.123134V k , or with the conclusion that T 1 is no better than T 2 if Z k ≤ −10.93898 + 0.369402V k . The design is constructed using published code, 19,20 and the risk of one-sided type I error is set to 0.025 and the power for an odds ratio of 1.5 to 0.90. Note that the boundaries differ slightly from those used in the four treatment case, because the latter were based on the properties of pairwise double triangular tests. Here, T 1 can be thought of as the experimental treatment and T 2 as the control: the design is asymmetric in dealing with the treatments. The maximum value of V is V 20 = 88.8380, at which point the stopping boundaries meet. Hence V 1 = 4.4419. For p 1 = 0.60 and p 2 = 0.50, so that = log(1.5), the total sample size per interim analysis is approximately 72 (36 per treatment). In simulations reported here, additional interim analyses are conducted beyond the 20 initially planned, up to a maximum of 25, if increments in V fall short of the anticipated value of 4.4419 and no boundary has been reached. In practice, if increments in V are observed to be low, then sample sizes per interim can be increased. Table 2 shows the results of 12 simulated realizations of this triangular design, ordered by increasing strength of evidence that T 1 is better than T 2 . Also given are results of a naïve analysis in which the sequential nature of the design is ignored, and a valid analysis based on the ordering of Fairbanks and Madsen. 21 For the naïve analysis, the estimated value of is taken to beθ = Z*/V* with SE = 1/ √ V*, and the corresponding 95% confidence interval ( L , U ) is (θ ± 1.96SE), where Z* and V* are the values of Z and V found from the final dataset. The orderings analysis is computed following References 19 and 20. In each computation the value of V i is taken to be equal to i*V 20 /20. In practice the true values of the V i would be used, but the approximation is used here for simplicity, and to allow readers to check the computation of the estimates. The analysis methods developed in the next section do not depend on the intermediate values of the V i . The bias-adjusted estimate 7 has no corresponding accurate method for computing confidence intervals and for that reason, it is not explored here.
The orderings analysis provides valid P-values and reduces estimates of when the upper boundary is crossed and increases them in the case of the lower boundary. It provides totally satisfactory results based on the actual sequential design used. However, it is difficult to see how it might be generalized for use following a sequential comparison of more than two treatments.  Note: Terminal values of the number of interim analyses, total sample size, the numbers of successes on T 1 and T 2 , and of the statistics Z and V are shown as int*, n*, S 1 *, S 2 *, Z*, and V*, respectively. Patients are evenly divided between the two treatments so that n 1 * = n 2 * = 1 ∕2n*. b* denotes the boundary crossed, with 0 denoting the lower boundary and 1 the upper boundary. For the naïve analysis, the estimated value of is Z*/V* with 95% confidence interval ( L , U ) = (θ ± 1.96/ √ V*). The orderings analysis is based on the ordering of Fairbanks and Madsen 21 and computed following. 19,20

POST-TRIAL ESTIMATION BASED ON RAO-BLACKWELLIZATION FOR A SIMPLE TRIANGULAR TEST
The Rao-Blackwellization approach 14,15 is based on the estimateθ 1 = Z 1 ∕V 1 deduced from the data available at the first interim analysis, which is unbiased for as it does not depend on the stopping rule in any way. Consequently, the estimatẽ θ = E(θ 1 |Z * , V * ), is also unbiased for and has smaller variance. The statistics (Z*, V*) are jointly sufficient for , as will be demonstrated in Section 4.1 below. They are not complete, so that it cannot be claimed thatθ is the minimum variance unbiased estimate. However, a less universal statement can be made, as follows. The estimateθ is truncation-adaptable, meaning that it depends only on the form of the interim analyses that were performed and not on those that were planned to take place but did not. (Orderings analyses are also truncation-adaptable, but the bias-adjusted method 7 is not.) The estimatorθ achieves minimum variance within the class of unbiased truncation-adaptable estimators. 22 Now E(θ) = , and In order to compute confidence intervals, it will be assumed that the pivot {θ − E(θ)}∕ √ var(θ) follows a standard normal distribution and that E{var(θ 1 |Z * , V * )} can be reliably estimated by var(θ 1 |Z * , V * ). Thus, the SE ofθ is given by and an approximate 95% confidence interval for is (θ±1.96SE(θ)). It is unlikely that either of the assumptions on which this approach is based are more than approximately true. The accuracy of the derived confidence intervals should be evaluated by simulation for any given application. The theoretical basis for the unbiasedness of the estimateθ is far stronger than that for the accuracy of the confidence interval. Two methods for evaluatingθ and SE(θ) will now be developed. The first, Method RB1, is an analytical approach depending on known properties of the triangular test. It is infeasible to generalize RB1 to the four treatment case, and it is included here for comparison and checking. Method RB2 employs reverse simulation to recreate replicate observations of Z 1 and V 1 , and is applicable in complicated situations such as a comparison of four treatments.

Method RB1
Denote the lower and upper stopping limits for Z k at the kth interim analysis by k and u k , respectively, k = 1, 2, . . . . The sequential design based on the first n of these interim analyses, which is then truncated, is denoted by R n . The interim analysis at which the design R n actually stops will be denoted by K [n] , and the corresponding final values of the test statistics by Z [n] and V [n] . Eq. 5.38 of Reference 8 defines the function f [n] (z, k, ) to be The sequence of functions f [n] (z, k, ) for z < k or z > u k , k = 1,… , n − 1, and f [n] (z, n, ), together provide a density for the final position of the test statistics (Z [n] , V [n] ) over all of their possible final values. Using eq. 5.41 of Reference 8; it can be shown that for any , f [n] (z n , n, ) = exp(z n − 2 V n )f [n] (z n , n, 0), as pointed out by Emerson and Kittelson. 15 Now, let This is the probability that the design R n stops at the nth interim analysis with Z [n] ≤ z n . In fact, it is the probability that any design which shares with R n the stopping limits for its first n interim analyses stops at the nth interim analysis with Z [n] ≤ z n . This function can be evaluated using the SAS function SEQ. Note that The probability that the design R n stops at the nth interim analysis with Z [n] ∈ (z n − z, z n + z) is given by F [n] (z n + z, n, ) − F [n] (z n − z, n, ).
Next, consider the adjusted sequential test, in which the first lower limit is amended to be 1 + t, t ∈ (0, u 1 − 1 ). The functions corresponding to f [n] and F [n] for this design will be denoted by f (t) [n] and F (t) [n] , respectively. Then, the probability that the sequential design R n starts with a value of z 1 lying above 1 + t and then later stops at the nth interim analysis with Z [n] ∈ (z n − z, z n + z) is given by F (t) [n] (z n + z, n, ) − F (t) [n] (z n − z, n, ). Denote the conditional probability that Z 1 lies above 1 + t, given that the trial stops at the nth interim analysis with Z [n] ∈ (z n − z, z n + z) by S(t; ). Then S(t; ) = P(Z 1 > 1 + t| K [n] = n, Z [n] ∈ (z n − z, z n + z)). It follows, using, (4) that where the value of is suppressed in the notation for S. This confirms the sufficiency of the statistics Z [n] and K [n] . It can be shown that Suppose that a trial with stopping limits at the first n interim analyses of ( 1 , u 1 ), … , ( n , u n ) stops with Z n = z n . Thus K [n] = n and Z [n] = z n . The value of F [n] (z n + z, n, 0) − F [n] (z n − z, n, 0) can be evaluated using the SAS function SEQ for the trial stopping limits with the modification that the nth continuation region is (z n − z, z n + z). The value of z is chosen to be small, but large enough for the resulting nth continuation probability to be reported with a reasonable number of decimal places. Then F (t) [n] (z n + z, n, 0) − F (t) [n] (z n − z, n, 0) is evaluated in a similar way, but for a design with first continuation region given by ( 1 + t, u 1 ) for a grid of values of t between 0 and u 1 − 1 . This allows S(t) to be found from (5) for the same grid of values, from which the conditional mean and SD of Z 1 given K [n] = n and Z [n] = z n can be found from (6) using numerical integration. Note that the function SEQ is constructed for stopping limits for Z n / √ V 1 . This necessitates some intricate programming in order to obtain the correct answers.

Method RB2
For reverse simulation, the estimateθ = E(θ 1 |S * , n * ) is used, where S* and n* are the vectors of numbers of successes and numbers of patients, by treatment, in the final dataset. The final interim analysis will be taken to be the Kth. The number of successes on T i at the kth interim analysis, S ik , is simulated as a hypergeometric observation, being the number of successes in a draw of n ik patient responses from a total of n i,k + 1 responses of which S i,k + 1 are successes, i = 1, 2; k = K − 1, K − 2,… ,1. For each replicate simulation, the estimateθ 1 = Z 1 ∕V 1 is found from (1) using the simulated numbers of successes on the two treatments at the first interim analysis. All values of Z k and V k , k = 1, … , K − 1, are checked, and any simulated sample path that corresponds to a trial that would have stopped according to the sequential design prior to the Kth interim analysis is deleted from the set of simulated runs. Then the mean and variance of the remaining values ofθ 1 are used asθ and var(θ 1 |S * , n * ) respectively. The latter is used in place of var(θ 1 |Z * , V * ) in (2) to provide a value for SE(θ). The set of simulated realizations of the first interim analysis can be used in a similar way to provide unbiased estimates of the success probabilities p 1 and p 2 , allowing for the sequential nature of the trial. Table 3 presents results from analyses of the 12 cases presented in Table 2 using Rao-Blackwellization methods. For Method RB1, the value of z in (5) was set at 0.01, and a grid of 100 points was used to evaluate the integrals in (6). For Method RB2, 10 million-fold reverse simulations were generated, and the first column in the RB2 section of the table shows that between 17.0% and 99.3% of the replicates were complete: that is, they corresponded to sequential trials that would not have stopped prior to the observed final interim analysis. Figure 2 shows the estimate and confidence intervals from the Rao-Blackwellization approaches and the same quantities from the naïve and orderings analysis, plotted against the values of the naïve estimates. The value of the naïve estimate is subtracted from all quantities, in order to provide a

F I G U R E 2 Estimates and 95%
confidence limits for from the Rao-Blackwellization approaches, the orderings analysis, and the naïve approach-with the naïve estimate subtracted-plotted against the naïve estimate for Cases 1 to 12 clearer view of the differences between the methods. The 12 cases are ordered with respect to the naïve estimates, and so Cases 1 to 12 are the points indicated by crosses running from left to right. The vertical line at = 0.2462 represents the value of treatment effect at which the trend of the plot of Z against V would head for the tip of the triangle, for this is the average of the boundary slopes. When the naïve estimate lies below 0.2462, both adjustments increase the magnitude of the estimate, with those due to the Rao-Blackwell estimate being greater. When the naïve estimate lies above 0.2462, both adjustments reduce the magnitude of the estimate, with those due to the Rao-Blackwell estimate again being greater. Adjustments using Method RB1 are a little more extreme than those using RB2.
The naïve confidence limits are narrowest and will fail to meet the target coverage probability. The Method RB1 leads to the widest intervals, followed closely by RB2. When the naïve estimate lies below 0.2462, adjusted limits lie above TA B L E 4 Evaluation of the naïve and the Rao-Blackwellization methods based on 1000-fold simulations the corresponding naïve limits and when it lies below 0.2462 they lie below. This effect is greatest for the Rao-Blackwell approaches. In cases in which there is a large overshoot of the boundary at the final interim analysis (Cases 6, 8, 9, and 11), the adjustments for sequential analysis have the greatest effect on the estimate of . In cases where the overshoot is small (Cases 5, 10, and 12), the adjustments for sequential analysis have less effect on the estimate of . The SAS programs leading to the RB1 and RB2 analyses shown in Table 3 are provided as supplementary material of this article. Table 4 presents the results of 1000-fold simulations of the naïve approach and of Methods RB1 and RB2 for three true values of . These are the null value, 0; the alternative value log(1.5) = 0.405; and between these the value 0.246 which is the average of the two boundary slopes. In each case the control success probability was set at p C = 0.6. The results from the naïve approach confirm that there is a problem to be addressed. In particular, when = 0.405 the effect of the treatment is systematically overestimated, and in all three cases the coverage probability of confidence intervals is inadequate. For Method RB1, the grid is again constructed of 100 points and the value of z set as 0.01. For Method RB2, for reasons of computing time, 1 million replicate reverse simulations were used rather than the 10 million that underlay the results presented in Table 3. The Rao-Blackwellized estimateθ is presented with its SD computed from the 1000 replicate values and its SE, which is the mean of the values computed from (2). These two values are close to one another, in support of the basis of computation. For both methods, the bias ofθ is small. The coverage probabilities of the 95% confidence intervals are around 0.970 (and significantly greater than 0.95 at the one-sided 2.5% level) in each case. They could therefore be used in practice as conservative computations.

APPLICATION TO THE COMPARISON OF FOUR TREATMENTS
A single set of simulated data consistent with the design proposed in Section 1 is used to illustrate the implementation of Rao-Blackwellization in a more complicated situation. Table 5 displays the data from this single realization. This summary is sufficient for analysis according to Method RB2. There are six pairwise treatment comparisons to consider. Table 6 presents the final values of the test statistics Z and V for each of these comparisons. Treatment T 2 was eliminated in comparison with T 1 at the fourth interim analysis, and T 4 followed at the fifth. This left T 1 and T 3 , which continued to be monitored until the 12th interim analysis, at which point T 1 was found to be the winner. Whether the analysis is conducted allowing for the sequential design used or not, two options are available for the final analysis. Option 1 is to use all data available on each treatment in making each comparison. Option 2 is to restrict the data used in any pairwise comparison to that collected from patients randomized when both treatments were still in contention. This is the form of analysis reflected in the values of Z and V displayed in Table 6. It avoids biases that may be caused by any temporal effects on the nature of the patients recruited, on the manner in which treatments were administered, or on how observations were recorded. Option 2 will be adopted here.
To implement Option 2, three separate reverse simulations have to be performed. To compute the estimateθ 13 and its SE, reverse simulation is conducted from the 12th interim analysis, at which T 1 was found to be better than T 3 , leading to the termination of the whole trial. From Table 5, it can be seen that at the 12th interim analysis at Centre 1, T 1 had been administered to 103 patients with 83 successes and T 3 to 111 patients with 85 successes. At the 11th interim analysis at Centre 1, T 1 had been administered to 98 patients and T 3 to 102 patients. For the reverse simulation, the number of successes on T 1 at Centre 1 is generated as a hypergeometric random variable: the number of successes from 98 patients drawn randomly from 103 of which a total of 83 are successes. The number of successes on T 3 at Centre 1 is generated TA B L E 5 Raw data from a single simulation of the four treatment design Treatment Interim Centre n S Sample size at each interim Number of successes at each interim   1  12  1  103  83 11, 18, 30, 41, 50, 57, 65, 76, 86, 92, 98, 103  10, 17, 27, 35, 41, 46, 53, 63, 69, 74, 78, 83   2  100  67 10, 16, 25, 33, 41, 49, 60, 71, 82, 88, 96, 100  10, 14, 20, 25, 30, 34, 40, 47, 58, 61, 65, 67   3  104  64 7, 17, 25, 35, 44, 55, 63, 68, 72, 83, 90, 104  6, 11, 16, 20, 26, 32, 36, 41, 43, 49, 55, 64   4  125  68 8, 21, 28, 35, 45, 55, 64, 73, 84, 97, 112,  similarly, as are the success counts for other centers. These success counts are then used to generate the numbers of successes on the two treatments at the 10th interim analysis, and so on back to the first interim analysis. In the reverse simulation, the numbers of patients and of successes on T 4 at the fifth interim analysis is taken to be as recorded in Table 5, and the numbers of successes at earlier interim analyses are filled in by hypergeometric simulation; for T 2 the reverse simulation begins at the fourth interim analysis. In the example, no treatment was eliminated at the first interim analysis. If one of them had been, then the Rao-Blackwellization process would amount to taking the estimates of the log-odds ratios featuring that treatment directly from the first interim analysis. The next step is to determine which of the reverse simulated runs are consistent with the outcome of the trial, and to delete those which are not. For each reverse simulated run, every remaining treatment comparison is considered at each interim analysis in turn. The relevant stratified values of Z and V can be computed from the simulated success counts. Consider the comparison between treatments T i and T j , i ≠ j = 1, 2, 3, 4. First, consider an interim analysis which in the real trial is the last for both T i and T j . In such a case the reverse simulated data for both treatments will be identical to those used in the actual trial, and the conclusions will be the same. No runs will be deleted on the basis of these data. Now consider an interim analysis which in the real trial is the last for T i , but after which T j continued to be observed. If, in the real trial, T j was found better than T i at this interim analysis, then any reverse simulation for which this did not occur is deleted. It is possible that in the real trial T j was not found better than T i at this interim analysis, T i being eliminated in comparison with another treatment. In this case, any reverse simulation in which T i was found to be better than, or worse than, T j is deleted. Furthermore, any reverse simulation run that ends at this interim analysis with the conclusion that there is no difference between any of the remaining treatments will be deleted.
Finally, consider an interim analysis which in the real trial is not the last for either T i or T j . Any reverse simulation for which T j was found better than, or worse than, T i at this interim analysis is deleted. Once more, any reverse simulation run that ends at this interim analysis with the conclusion that there is no difference between any of the remaining treatments will be deleted. TA B L E 6 Comparative data derived from For each of the reverse simulation runs that remains after the deletion process, the estimateθ 13 = Z 131 ∕V 131 is found from (1) using the reverse simulated stratified test statistics for the comparison of T 1 and T 3 from the first interim analysis. The mean of the values ofθ 13 provides the RB2 estimateθ 13 and the corresponding variance provides var(θ 13 |S * , n * ). The latter is used in a suitably amended version of Equation (2) to provide a value for SE(θ 13 ).
A second reverse simulation is then run, starting at the fifth interim analysis, and using the actual numbers of successes on T 1 , T 3 , and T 4 at each center at that analysis as the starting point for each reverse simulation. Following the deletion of runs that would have been incomplete,θ 14 andθ 34 and their corresponding SEs are found. The third reverse simulation starts at the fourth interim analysis and uses the actual numbers of successes observed on all treatments at each center at that analysis as the starting point for each reverse simulation. This provides the estimatesθ 12 ,θ 23 , andθ 24 and their corresponding SEs.
In the results that follow, one modification of the method implemented in the unstratified case is made. For the purposes of computing theθ ij and their SEs only, V ijc1 is replaced by V ′ ijc1 , where TA B L E 7 Analyses of the data from the single simulated run of the sequential four treatment comparison shown in Tables 5 and 6 Naïve RB2 Note: In the naïve analyses, the sequential nature of the trial is ignored. The Rao-Blackwellization method, RB2, is based on 10 million replicate reverse simulations, the proportion of these that were complete is shown in the sixth column.
and the additional subscript c indicates the center, c = 1, … , 4. The usual expression for V ijc1 , is used during the conduct of the trial and when assessing whether simulated trial runs are complete. However, it isθ ij1 = (Z ij11 that is averaged over complete simulated runs to provideθ ij and used to determine SE(θ ij ). The reason for this change is pragmatic: without it estimates show excessive bias and SEs are too small or sometimes nonexistent as Equation (2) involves the square root of a negative value. Use of V ′ ijc1 largely avoids these problems, as In the unstratified case the sample sizes per treatment at the first interim analysis are quite large, and so this level of attention to detail is unnecessary. In the stratified case, it is the sample sizes within center that determine the accuracy of the procedure, and without the use of (7) these are now too small to guarantee the accuracy of the estimates, or the existence of the SEs. Table 7 compares a naïve analysis in which pairs of treatments are compared using the data available at the last interim analysis in which both were present but ignoring the sequential nature of the trial, with the RB2 method described above. The number of reverse simulations was set at 10 million. It can be seen that the effect of allowance for the sequential design is to reduce the magnitude of the estimates of the advantage of T 1 over each of T 2 and T 4 , while the estimate of the advantage of T 1 over T 3 is hardly changed. The corresponding confidence intervals are all widened. The other estimates of treatment effects have also been reduced in magnitude, but the effect on their SEs is less marked. The SAS program leading to the RB2 analyses shown in Table 7 is provided as supplementary material of this article. Table 8 shows the results from 1000 replicate simulations of a situation in which T 1 is the best treatment. To achieve a feasible computational time, 1 million reverse simulations are used in each analysis. Furthermore, for ease of computation, Option 1 is chosen so that a single set of reverse simulations will yield estimates and confidence intervals for all treatment comparisons. For comparison, the results from naïve analyses based on the test statistics Z and V comparing the final samples simulated from each treatment (ie, using Option 1) are also shown. In 103 of the 1000 replicate simulations, fewer than 1000 of the million reverse simulations led to sample paths that were consistent with the outcome of the trial and thus survived the deletion process described above. For the purpose of this investigation, the results from these runs are considered unreliable and are omitted. In practice the number of reverse simulations would be raised to 10 million or beyond to yield sufficient consistent reverse simulations, this being feasible for single analyses but not for 1000 as required here.
The comparisons of T 1 with the three rival treatments each lead to overestimation of treatment effect when the naïve analysis is used, whereas the estimates drawn from RB2 show much smaller biases. The results for the other comparisons are similar for the two approaches, with RB2 being a little less biased. In most of the simulated realizations, the timing of these comparisons will have been determined by the completion of others, and so the effects of the sequential design would be expected to be less marked. The coverage probabilities for confidence intervals based on the naïve approach are too low, while those for RB2 are conservative, but satisfactory. Other simulations were conducted in which each RB2 analysis depended on only 100 000 reverse simulations. These led to less accurate estimation and markedly conservative confidence intervals. It appears that, provided that sufficient reverse simulations are used, RB2 leads to accurate analyses that overcome the potential bias inherent in the use of data-dependent elimination and stopping rules.

APPLICATION IN A SIMPLER SETTING
The application for which the trial design was developed concerned a large study that would have recruited over a lengthy time period, and which would have benefited from the large number of interim analyses planned. It is of interest to explore what would happen were the method implemented within a smaller and simpler study. For this purpose, a fictitious example is considered. The illustrative trial concerns a comparison of four treatments yielding binary observations. Stratification is not allowed for. The Type I and Type II error requirements are that the probability that a treatment is wrongly found to be the sole winner should be ≤0.025, and that the worse of two treatments separated by an odds ratio of 2.25 should be eliminated with probability ≥0.90. Interim analyses are to be conducted after 32 observations per remaining treatment. The double triangular design for two treatments satisfying this specification is used for the four treatment comparison, in the same way as described in Section 2. For the upper triangle of the continuation region, the lower and upper boundaries are: Z = −4.9261 + 0.7411V and Z = 4.9261 + 0.2470V.
Up to eight interim analyses are allowed, but no more than 640 patients in total. Table 9 presents the results of million-fold simulations, confirming that the Type I and Type II error requirements are satisfied. Table 10 contrasts the properties of naive estimation of treatment effects with those when the RB2 method is used in this setting. The comparative and absolute properties of the RB2 analysis in this simpler setting are similar to those for the motivational example shown in Table 8. Note: win 1 = proportion of runs in which T 1 wins; elim 4 = proportion of runs in which T 4 is eliminated; nod = proportion of runs in which: for Cases 1 and 2, T 1 and T 2 are declared no different from one another; for Case 3, T 1 , T 2 , and T 3 are declared no different from one another; for Cases 4 and 5, all treatments are declared no different from one another; still = proportion of runs in which not all treatment comparisons are resolved after 640 responses.

TA B L E 10
Evaluation of the naïve method and the Rao-Blackwellization method RB2 for the simpler four treatment design Method Naïve RB2

CONCLUSIONS
The approach presented here for estimation following a sequential trial is quite general, and can be implemented for a wide variety of designs. In the case of a comparison of a single experimental treatment with a single control arm, the method works and provides satisfactory results, as has been demonstrated in Section 4 above. However, there are already numerous methods of computing point and interval estimates in the two-treatment context. In particular, methods based on orderings of the final sample space are just as good for computing point estimates and more accurate for finding confidence intervals than the approach introduced here. They are also less computationally demanding. The utility of the approach described here is in more complicated designs comparing multiple treatments or with flexible adaptive features, as reverse simulation is based only on the form of the stopping rules implemented and not on their theoretical properties. The method has been illustrated and evaluated for one particular form of comparison of four treatments which motivated its development, but its implementation is certainly not restricted to that design.
The claim for the unbiasedness of estimates produced using Method RB2 is underpinned by rigorous asymptotic theory, and the simulation results obtained for their accuracy in Sections 5 and 6 are satisfactory. The method for deriving confidence intervals is less secure as it depends on two unverified assumptions: that the expected conditional variance of the unbiased estimate at the first interim analysis can be approximated by its observed value from reverse simulations, and that the adjusted estimate follows the normal distribution. Simulations presented in Sections 4 to 6 demonstrate that the resulting intervals are conservative but useable. It should be repeated that the number of reverse simulations needed to achieve satisfactory results is large. Here, in single demonstration analyses, 10 million replicates were used. In earlier work, we found that using fewer replicates led to less satisfactory results.
Of course, the scenarios that could be investigated by simulation are limitless, and only two have been explored here. In particular, both of the cases considered have involved success probabilities that are in the region of 1 ∕ 2 . Prior to application in trials where the success probabilities are likely to be close to 0 or to 1, further evaluation might be appropriate.