SEARCH

SEARCH BY CITATION

Keywords:

  • Adaptive seamless designs;
  • Confidence intervals;
  • Estimation;
  • Multi-arm multi-stage trials;
  • Treatment selection

Abstract

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Confidence interval calculation for the treatment difference after an ASD
  5. 3 A comparison of the confidence intervals using a simulation study
  6. 4 Worked example
  7. 5 Discussion
  8. Acknowledgements
  9. Conflict of interest
  10. References

Recently, in order to accelerate drug development, trials that use adaptive seamless designs such as phase II/III clinical trials have been proposed. Phase II/III clinical trials combine traditional phases II and III into a single trial that is conducted in two stages. Using stage 1 data, an interim analysis is performed to answer phase II objectives and after collection of stage 2 data, a final confirmatory analysis is performed to answer phase III objectives. In this paper we consider phase II/III clinical trials in which, at stage 1, several experimental treatments are compared to a control and the apparently most effective experimental treatment is selected to continue to stage 2. Although these trials are attractive because the confirmatory analysis includes phase II data from stage 1, the inference methods used for trials that compare a single experimental treatment to a control and do not have an interim analysis are no longer appropriate. Several methods for analysing phase II/III clinical trials have been developed. These methods are recent and so there is little literature on extensive comparisons of their characteristics. In this paper we review and compare the various methods available for constructing confidence intervals after phase II/III clinical trials.

1 Introduction

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Confidence interval calculation for the treatment difference after an ASD
  5. 3 A comparison of the confidence intervals using a simulation study
  6. 4 Worked example
  7. 5 Discussion
  8. Acknowledgements
  9. Conflict of interest
  10. References

The need to make the drug development process more efficient has led to designs that combine different phases into a single trial that is conducted in two or more stages. Such designs are commonly referred to as adaptive seamless designs (ASDs), because before the final stage one or more interim analyses are performed using data collected at preceding stages to make adaptations such as treatment selection, subpopulation selection or endpoint selection. An example of a trial that could have been conducted using an ASD of the form that we will consider in this paper is reported by Wilkinson and Murray (2001). In the trial, three doses (18 mg/day, 24 mg/day and 36 mg/day) of a new treatment, galantamine, for the alleviation of Alzheimer's disease were compared to placebo. The primary efficacy endpoint is the Alzheimer's Disease Assessment Scale cognitive (ADAS-cog) subscale score (Rosen et al., 1984). Tolerance to the doses was of interest and so adverse events such as vomiting and headache were recorded. Interim analyses were performed in intervals of approximately 20 additional patients in each group. Using interim analysis results, recruitment in an experimental treatment arm could be stopped because of either tolerance concerns, overwhelming evidence of superiority over placebo (efficacy) or for futility. Three separate group sequential plans for the three doses were used to monitor efficacy and the decision to continue or stop recruiting to a dose did not depend on the results from the other doses. Using the second interim analysis results, recruitment to doses 24 mg/day and 36 mg/day was stopped for efficacy and tolerance concerns, respectively. Dose 18 mg/day was declared efficacious using the third interim analysis results. Thus the trial consisted of more than two stages and more than one experimental dose could continue to the final stage. In this paper, we are interested in an alternative design, but one with the same objectives, that could have been used to conduct the Wilkinson and Murray (2001) trial. We consider two-stage ASDs where, in stage 1, several experimental treatments are compared to a control and based on observed stage 1 data, the apparently most promising experimental treatment and the control continue to stage 2. After collection of stage 2 data, a confirmatory analysis that includes the stage 1 data is performed. Such a trial is often termed a seamless phase II/III clinical trial because stage 1 resembles a phase II trial and stage 2 resembles a phase III trial.

The well-established statistical methods for analysing trials that compare a single experimental treatment to control and have a fixed sample size are not appropriate for phase II/III clinical trials. Appropriate methods for analysing phase II/III clinical trials need to account for the comparison of a control to more than one experimental treatment and the inclusion of data used in selecting the apparently most promising experimental treatment in the confirmatory analysis. Several methods for hypothesis testing after a phase II/III clinical trial that control the overall type I error rate have been proposed (Thall et al., 1988; Schaid et al., 1990; Bauer and Köhne, 1994; Hommel, 2001; Stallard and Todd, 2003; Bretz et al., 2006; Koenig et al., 2008). Jennison and Turnbull (2007) have compared the properties of most of these methods. Similarly, methods for point estimation following a phase II/III clinical trial that account for treatment selection and comparison of the control to many experimental treatments to produce approximately unbiased estimates have been proposed (Cohen and Sackrowitz, 1989; Shen, 2001; Stallard and Todd, 2005; Carreras and Brannath, 2013; Kimani et al., 2013). Carreras and Brannath (2013) have carried out an extensive comparison of most of the methods for point estimation.

In this paper, the focus is interval estimation following a phase II/III clinical trial. Like hypothesis testing and point estimation, several methods for constructing confidence intervals after a phase II/III clinical trial have been proposed (Posch et al., 2005; Sampson and Sill, 2005; Stallard and Todd, 2005; Wu et al., 2010; Neal et al., 2011; Magirr et al., 2013). Stallard and Todd (2005), Posch et al. (2005) and Magirr et al. (2013) produce a confidence region for all treatment differences from which simultaneous confidence intervals for all the treatment differences between the experimental treatments and the control may be obtained. In contrast, Sampson and Sill (2005), Wu et al. (2010) and Neal et al. (2011) directly construct a confidence interval for the treatment difference of the selected treatment with the control only. In this paper, we will compare confidence intervals constructed using these various approaches for the treatment difference of the selected treatment to enable all methods to be considered. Sampson and Sill (2005), Stallard and Todd (2005) and Wu et al. (2010) assume the apparently most effective experimental treatment is selected to continue to stage 2, while the methods of Posch et al. (2005), Neal et al. (2011) and Magirr et al. (2013) allow flexible selection of the experimental treatment to continue to stage 2 such as that also considering safety data. For the comparison, we will consider the setting in which the control and the apparently most effective treatment continue to stage 2. Neal et al. (2011) generalise the results of Wu et al. (2010) to include the case where the selected treatment is not necessarily the most effective. When the apparently most effective treatment is selected, the formula for the confidence interval bounds following Neal et al. (2011) is the same as the formula for the confidence bounds following Wu et al. (2010) and so in our comparison, where we consider only selection of the apparently most effective treatment, it is sufficient to consider only Wu et al. (2010). Also, when the apparently most effective treatment is selected, the confidence intervals following Posch et al. (2005) and Magirr et al. (2013) coincide and so in our comparison, where we consider only selection of the most effective treatment, it is sufficient to consider the method of Posch et al. (2005). We will discuss in Section 'Discussion' the difference between Posch et al. (2005) and Magirr et al. (2013) methods and the expected difference between the confidence intervals following these methods for the case where the selected treatment is not the apparently most effective. Finally, because the method of Sampson and Sill (2005) assumes stage 2 data are always observed, we will consider the setting where the trial always continues to stage 2.

The remainder of the paper is organised as follows. In the next section, using a unified notation, we review the four selected methods for constructing confidence intervals after a phase II/III clinical trial. We compare the various confidence intervals using a simulation study in Section 'A comparison of the confidence intervals using a simulation study' and demonstrate how to compute the confidence intervals in Section 'Worked example'. We discuss the findings of Sections 'Confidence interval calculation for the treatment difference after an ASD''Worked example' in Section 'Discussion'.

2 Confidence interval calculation for the treatment difference after an ASD

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Confidence interval calculation for the treatment difference after an ASD
  5. 3 A comparison of the confidence intervals using a simulation study
  6. 4 Worked example
  7. 5 Discussion
  8. Acknowledgements
  9. Conflict of interest
  10. References

2.1 Setting and notation

We assume k (inline image) experimental treatments and a control are tested in stage 1. At stage 1, n1 patients are randomly allocated to each experimental treatment and the control. We consider normally distributed outcomes with unknown means and known variances. We assume a common variance σ2 across the inline image treatment groups. We denote the mean from treatment i (inline image), with inline image corresponding to the control, by inline image and the treatment difference inline image (inline image) by inline image. The stage 1 sample mean from treatment group i is denoted by inline image and follows a normal distribution inline image, where inline image. The observed sample means at stage 1 are denoted by inline image. The experimental treatment with the largest inline image and the control continue to stage 2. The selected experimental treatment is random and is denoted by S inline image. At stage 2, n2 patients are randomly allocated to treatment i (inline image). If we denote the stage 2 sample mean for treatment i (inline image) by inline image, then inline image, where inline image. At the end of a phase II/III clinical trial, we are interested in obtaining the two-sided inline image% confidence interval for the treatment difference inline image.

For the control and the selected treatment, S, the proportion of stage 1 data that are used in the interim analysis is inline image and we call this proportion the selection time. We denote selection time by t so that the sample means for the control and the selected treatment are, respectively, inline image and inline image. The sample treatment difference may be expressed as

  • display math

If we ignored treatment selection and comparison of the control to more than one experimental treatment, we would have that inline image, where inline image, and the 100inline image confidence interval would be given by

  • display math(1)

where inline image is the observed value of inline image and inline image is the standard normal quantile such that inline image. We will refer to the confidence interval obtained using expression (1) as the naive confidence interval. The properties of this confidence interval are likely to be undesirable. For example, the probability that the lower limit of the naive confidence interval is above inline image is greater than inline image because the confidence interval does not account for the fact that treatment S is selected precisely because it appears to be the most effective. We will compare the coverage properties of the naive confidence interval with confidence intervals constructed using methods that account for the treatment selection.

The various confidence intervals for inline image that we describe in this paper are constructed using the duality between hypothesis testing and confidence intervals. Whitehead (1997), for example, has described how this may be done. We first describe this for the simple case where a single parameter θ is of interest. Suppose an ordering of the possible data sets is defined so that it is possible to recognise when one data set inline image provides stronger evidence in favour of a large θ than another data set inline image and we write inline image when inline image provides equivalent or stronger evidence when compared with inline image. We define a p-value function of the unknown parameter by

  • display math

where inline image is a possible data set and inline image is the observed data from the trial. If the ordering is such that if inline image, inline image and inline image is a continuous function of θ, then the ordering is referred to as a stochastic ordering. For a stochastic ordering, Whitehead (1997) describes that, if inline image is such that for data inline image, inline image, then

  • display math

The various methods of constructing confidence intervals described in this paper use two techniques based on this property. The first technique is based on the fact that for inline image and inline image calculated from the data inline image such that inline image and inline image, respectively, inline image forms a inline image confidence interval for θ. The second technique is based on noting that the inline image confidence interval for θ is the set inline image. Like Stallard and Todd (2005), we will refer to this technique of obtaining a confidence interval as the p-value inversion method.

The ordering may be defined using the sufficient statistics. If the distribution function of the sufficient statistic can be obtained, it is easier to use the first technique. Turning to our setting of the treatment selection case, suppose the distribution of inline image that accounts for treatment selection is known and its distribution function is denoted by inline image. Let inline image denote the observed value for inline image, and suppose inline image and inline image are obtained by, respectively, solving inline image and inline image. Following the first technique, the inline image confidence interval for inline image is given by inline image. As mentioned in Section 'Introduction', Sampson and Sill (2005) and Wu et al. (2010) directly construct the confidence interval for inline image. They use the distribution of inline image (or a function of inline image) to obtain the confidence interval as described above. More details are given in Sections 'Sampson and Sill (2005)' and 'Wu et al. (2010)' below.

The p-value inversion method is attractive when adjusted p-values, such as in the case of testing multiple hypotheses, are used. Stallard and Todd (2005) extend the p-value inversion method to the case where several experimental treatments are tested and the trial includes at least one interim analysis with the most effective treatment selected at the first interim analysis and the trial can stop early for efficacy or futility of the most effective treatment at any of the interim analyses. They produce a confidence region for the parameter vector of the treatment differences from which they obtain the confidence interval for inline image. Based on our setting, we describe how they do this in Section 'Stallard and Todd (2005)'.

As we see in Section 'Posch et al. (2005)', Posch et al. (2005) use the p-value inversion method but unlike Stallard and Todd (2005) who only define a p-value for testing the effectiveness of the selected experimental treatment, they define p-values that test the effectiveness of all the experimental treatments. For data inline image and for inline image in the parameter space for θ, the p-value inline image can be viewed as a p-value for testing the hypothesis

  • display math(2)

In order to produce a confidence region from which to extract the simultaneous confidence intervals, in Section 'Posch et al. (2005)', we describe how Posch et al. (2005) generalise this idea to the case of a parameter vector.

2.2 Sampson and Sill (2005)

To develop a method for calculating confidence intervals after a phase II/III clinical trial, Sampson and Sill (2005) use the first technique described in Section 'Setting and notation'. They derive the density of inline image, where n and inline image are as defined above. Let inline image be the order statistics of stage 1 sample means so that inline image. We denote the ordering by Q and we will use inline image and inline image to, respectively, denote the density and distribution functions conditional on Q. The density of W is obtained conditional on Q and the sufficient statistic. Let inline image, Sampson and Sill (2005) note that inline image is sufficient for the parameters inline image. Note that Z0 is sufficient for μ0.

Sampson and Sill (2005) write and reparameterise the density inline image to give a form that has parameters inline image, inline image and inline image. Part of the exponent in inline image is inline image, where inline image. The authors describe how uniformly most powerful unbiased tests of hypotheses concerning inline image based on the conditional distribution of W given inline image can be constructed. They show that the density inline image is given by

  • display math(3)

where inline image is a normalization constant, which involves integrating W over the real line. Sampson and Sill then note that a inline image confidence interval, inline image for inline image, can be constructed from inline image and inline image because inline image is a monotone likelihood ratio in T.

2.3 Wu et al. (2010)

Wu et al. (2010) also use the first technique from Section 'Setting and notation', but not explicitly, to propose a method for calculating confidence intervals after a phase II/III clinical trial. To obtain the lower and upper bounds, they use the critical values for testing the null hypothesis inline image and so we first describe how, using the distribution of inline image, they obtain the critical values. Wu et al. (2010) assume the test statistic inline image is used to test inline image so that if d and inline image are critical values chosen such that the required type I error rate is not inflated, the null hypothesis inline image is rejected if

  • display math

For a symmetric level α test, to obtain d while controlling the type I error rate in the strong sense, we require

  • display math

Wu et al. (2010) note that Bischoff and Miller (2005) showed that the supremum is attained at inline image. Without loss of generality, inline image so that inline image. Therefore, if we let inline image and let inline image denote the configuration (0, 0, …, 0), then d is obtained under inline image. Using a similar argument, without loss of generality, the least favourable configuration for obtaining inline image is inline image, inline image and inline image. Thus, if we denote the configuration inline image by inline image, d and inline image are obtained such that they, respectively, satisfy

  • display math

Wu et al. (2010) show that replacing inline image in expression (1) with d for the lower bound and inline image for the upper bound produces a confidence interval with a coverage that is at least inline image. If we take

  • display math(4)

Wu et al. show that for any given constants α, d and inline image, and if inline image and inline image are as defined above,

  • display math

for all μ so that coverage of the interval inline image is at least inline image. Note that under the least favourable configuration (inline image) that is used to obtain the upper confidence interval bound, the highest effective treatment is selected with probability 1. Therefore, for the case where the apparently most effective treatment continues to stage 2, we can ignore the fact that we are comparing several experimental treatments to control so that we take inline image and

  • display math

which is the usual upper bound of a normal density interval defined in expression (1).

The distribution of inline image and hence values of d and inline image can be obtained following Stallard and Todd (2003). The values of d are greater than inline image so that the confidence interval following Wu et al. (2010) will be wider than the naive confidence interval given by expression (1).

2.4 Stallard and Todd (2005)

Stallard and Todd (2005) develop a method for calculating confidence intervals based upon the p-value inversion method, the second of the techniques described in Section 'Setting and notation'. They begin by establishing a confidence region and then suggest several ways in which the region can be reduced to a confidence interval for the selected treatment. Define inline image to be the probability of observing a more extreme set of data than that actually observed in the trial, given inline image and inline image. Let inline image be the maximum-likelihood estimate of the treatment advantage for a set of data X. In the case that stage 2 data are always observed, more extreme evidence is defined to arise in the case where a larger estimate of inline image is observed at the final stage. We will write inline image if inline image. So,

  • display math

We define a p-value function by

  • display math

where inline image is the indicator function taking the value 1 if inline image and 0 otherwise. In their paper, Stallard and Todd show that inline image for any value of inline image, so that a confidence region for inline image with coverage α is given by

  • display math

In order to obtain a confidence interval for inline image, Stallard and Todd give two possible approaches. They assume either that inline image or inline image for inline image. This gives an approximate p-value function depending on inline image alone and hence an approximate confidence interval.

2.5 Posch et al. (2005)

Like Stallard and Todd (2005), Posch et al. (2005) use the p-value inversion method. They describe how to obtain the confidence region for the parameter vector and how to obtain confidence intervals from the confidence region. Before we describe how they define the confidence region, we first need to describe how they propose to conduct hypothesis testing after a trial that uses an ASD as this motivates how they obtain the confidence region. The primary null hypotheses of interest are inline image. We denote the set inline image by inline image. For inline image, we will write inline image. For example for inline image, we will simply write inline image for the intersection null hypothesis inline image. Posch et al. (2005) assume that testing will be conducted as described by Hommel (2001) and Bretz et al. (2006) among others, which involves constructing the closure set consisting of all hypotheses inline image, inline image. For a level α test, a primary hypothesis of interest inline image is rejected if and only if all hypotheses inline image inline image with inline image are rejected, each hypothesis tested at level α. For example, with inline image, H1 is tested using the hypotheses H123, H12, H13 and H1. This controls the type I familywise error rate in the strong sense by the closure principle (CP) of Marcus et al. (1976). Using the CP, testing is conducted in a stepwise manner starting with the global intersection hypothesis. For example, for inline image, the global hypothesis H123 is tested first. If H123 is not rejected, the testing stops and all the null hypotheses of interest (H1, H2, H3) are not rejected. If H123 is rejected, hypotheses inline image inline image with inline image are tested next. Hypothesis testing proceeds to H1 if both H12 and H13 are rejected. Similarly, testing proceeds to H2 if both H12 and H23 and to H3 if both H23 and H13 are rejected. We next describe how the p-values for testing the intersection hypotheses in the closure set based on both stages 1 and 2 data are evaluated.

For all hypotheses in the closure set, stage 1 p-values are obtained using stage 1 data. Similarly, stage 2 p-values are obtained for all hypotheses in the closure set using the stage 2 data. Because, in this paper, we consider the case in which only one experimental treatment, treatment S, and control continue to stage 2, hypothesis inline image is tested using the p-value for hypothesis inline image in stage 2 where inline image. If inline image, p-values for such hypotheses are set equal to 1 (Posch et al., 2005).

Posch et al. (2005) describe how to obtain one-sided confidence intervals (lower bounds) that correspond to the one-sided hypothesis tests for superiority of the experimental treatments over the control. Therefore, for each intersection hypothesis inline image, the alternative hypothesis is that the difference between the effect of the highest effective experimental treatment in set I and the control treatment is greater than 0. We denote the one-sided stage 1 and stage 2 p-values used to test superiority for the apparently most effective treatment over the control treatment in the null hypothesis inline image by inline image and inline image, respectively. When data are available, the p-values for elementary hypotheses inline image (inline image) are calculated using the usual pairwise tests such as, for example, a t-test or a chi-squared test. For the intersection hypothesis inline image with inline image, we will use the Šidak adjusted p-values. The stage 1 Šidak adjusted p-value for hypothesis inline image is given by

  • display math

We will also use the Dunnett adjusted p-values. For the Dunnett test, inline image is given by

  • display math

where ϕ and Φ, respectively, denote the standard normal density and distribution functions and inline image is the maximum standardised difference corresponding to the treatments in intersection hypothesis inline image. At stage 2, because only one experimental treatment is tested, p-value adjustment is not required and so inline image if inline image and inline image otherwise.

To test hypothesis inline image using data from stages 1 and 2, a combined p-value inline image is obtained using some combination test such as the Fisher's test (Fisher, 1932; Bauer and Köhne, 1994) or the inverse normal method (Lehmacher and Wassmer, 1999). The inverse normal method involves converting the p-values into standard normal variates and weighting them so that

  • display math(5)

where inline image are weights chosen in advance of the trial subject to inline image. Usually the weights are chosen with squares proportional to the stage 1 and stage 2 sample sizes. In Sections 'A comparison of the confidence intervals using a simulation study' and 'Worked example', we will set inline image and inline image. We will refer to this testing procedure that uses the CP and combination of p-values as the two-stage closed testing procedure using the Šidak or Dunnett correction.

Having described the testing procedure, we are in a position to describe how to construct the confidence intervals using the p-value inversion method. Similar to hypothesis statement (2), we can define a hypothesis statement for a parameter vector. For the parameter vector inline image define

  • display math(6)

and let the corresponding global intersection hypothesis inline image be denoted by inline image. For data inline image, let inline image and inline image, respectively, denote stage 1 and stage 2 p-values for testing inline image against the alternative that at least one of the differences between the experimental treatments and the control treatment inline image is greater than inline image. Note that following the testing procedure described above, none of the elementary hypotheses in hypothesis statement (6) will be rejected if the global intersection hypothesis inline image is not rejected. Thus, for data inline image, a inline image confidence region for inline image is given by all vectors inline image such that inline image.

Posch et al. (2005) note that, in general, a confidence region will not be the cross product of confidence intervals for inline image, inline image. In order to obtain confidence intervals, they propose embedding the confidence region defined above in a hyper-rectangle. They are interested in one-sided confidence intervals and to obtain the lower confidence bounds, they note that the confidence region is embedded in the rectangle by setting adjusted p-values

  • display math

for all inline image. Note that for the pairwise hypotheses (6), the supremum over the real line is 1 so that Posch et al. (2005) observe that the form of inline image for tests that use the pairwise p-values to obtain p-values for intersection hypotheses are of a simple form. For example, for the Šidak test,

  • display math(7)

For stage 2, because in our case only one experimental treatment continues to stage 2, we define

  • display math

No p-value adjustment is required for analysis using stage 2 data only so that for inline image, inline image and for inline image, inline image. Hence, based on the adjusted p-values, for data inline image, the simultaneous inline image one-sided confidence interval for inline image, inline image is defined by

  • display math(8)

2.6 Extending Posch et al. (2005) method to obtain two-sided confidence intervals

In this paper, we are interested in simultaneous two-sided confidence intervals so that in this section, we describe how to extend the Posch et al. (2005) method to the setting of two-sided confidence intervals. The two-stage closed testing procedure is used so that p-values are obtained for each intersection hypothesis inline image (inline image). Two-sided confidence intervals correspond to two-sided hypothesis tests so that for each null intersection hypothesis in the closure set described in Section 'Posch et al. (2005)', the alternative hypothesis is that the difference between the effect of the apparently highest effective experimental treatment in set I and the control treatment is not equal to 0. We define p-values for testing superiority and inferiority of the experimental treatments. As in Section 'Posch et al. (2005)', we, respectively, denote the one-sided stages 1 and 2 p-values used to test superiority for the most effective treatment over the control treatment in the null hypothesis inline image by inline image and inline image, respectively. We have described how to obtain these p-values using the Šidak and the Dunnett tests. For testing in the opposite direction, we respectively denote by inline image and inline image the stage 1 and stage 2 one-sided p-values that test inferiority for the apparently most effective experimental treatment over the control treatment in the null hypothesis inline image. After collecting stage 2 data, because only one experimental treatment is tested, using the same explanation as in Section 'Posch et al. (2005)', inline image if inline image and inline image otherwise. The stage 1 Šidak adjusted p-value for hypothesis inline image is given by

  • display math

where inline image is the pairwise p-value testing the comparison of the experimental treatment i to the control in favour of the alternative that inline image.

To test hypothesis inline image using data from stages 1 and 2, combined p-values inline image and inline image are obtained. Using the inverse normal method, inline image is given by expression (5) while the expression for inline image is given by

  • display math

where inline image and Φ are as defined in Section 'Posch et al. (2005)'.

Having described the two-sided hypothesis closed testing procedure, we are in a position to describe how to construct the two-sided confidence intervals. For the parameter vector inline image define

  • display math(9)

and as in Section 'Posch et al. (2005)', let the corresponding global intersection hypothesis inline image be denoted by inline image. For data inline image, let inline image and inline image, respectively, denote stage 1 and stage 2 p-values for testing inline image against the alternative that at least one of the differences between the experimental treatments and the control treatment inline image is greater than inline image. Similar p-values inline image and inline image are defined for a test in the opposite direction. Note that following the closed testing procedure described above, none of the elementary hypotheses in hypothesis statement (9) will be rejected if the global intersection hypothesis inline image is not rejected. Thus, for data inline image, a inline image confidence region for inline image is given by all vectors inline image such that inline image and inline image.

As in the case of one-sided confidence intervals, in order to obtain two-sided confidence intervals, the confidence region is embedded in a hyper-rectangle. This is done by defining adjusted p-values. The stages 1 and 2 adjusted p-values inline image and inline image, respectively, for testing the superiority of the apparently most effective treatment in hypothesis inline image are described in Section 'Posch et al. (2005)', with expression (7) giving an example of an expression for inline image. For tests of inferiority, in our case, because only one experimental treatment is selected, the adjusted p-value inline image for inline image and inline image for inline image. Following Posch et al. (2005), for the Šidak test, we would define inline image. However, as described in Section 'Wu et al. (2010)', for the case where the apparently most effective treatment is selected to continue to stage 2, we can ignore the fact that we are comparing several experimental treatments to the control treatment so that the adjusted p-value simplifies to

  • display math(10)

for all inline image. This leads to less conservative confidence intervals. We emphasize that inline image is obtained using expression (10) only for the case where the apparently most effective treatment is selected. In general, when some other selection rule is used, inline image should be obtained using the expression described previously. Similarly, for all inline image we set

  • display math

Hence, for data inline image, the simultaneous inline image two-sided confidence interval for inline image, inline image is defined by

  • display math(11)

3 A comparison of the confidence intervals using a simulation study

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Confidence interval calculation for the treatment difference after an ASD
  5. 3 A comparison of the confidence intervals using a simulation study
  6. 4 Worked example
  7. 5 Discussion
  8. Acknowledgements
  9. Conflict of interest
  10. References

3.1 Simulation study settings

In this section, we use a simulation study to assess the properties of confidence intervals constructed using the methods described above. From the density of W, which is given by expression (3), we note that a confidence interval depends on the number of experimental treatments, selection time t and the true parameter values for the treatment differences. Therefore, we will consider several scenarios varying these aspects. We have considered scenarios where the number of experimental treatments in stage 1 is between 2 and 4. We also consider different true parameter values for the treatment means. In all simulations we will take σ2 to be 1 so that we will consider small differences in treatment means, as might be anticipated in most clinical trials. In all simulations, we assume 400 patients are allocated to the control and another 400 patients are allocated to the selected experimental treatment.

We consider making selection at time points 0.2, 0.4, 0.6 and 0.8. For example, if selection is made at time point 0.2, at stage 1, 80 that is (0.2 × 400) patients are allocated to the control and to each experimental treatment, and at stage 2, 320 that is (400 − 80) patients are allocated to the control and the selected experimental treatment. For each time point and each scenario, we conduct 10,000 simulation runs. In each simulation run, we simulate stage 1 data, select the experimental treatment with the highest simulated sample mean and based on this, simulate stage 2 data. Using the simulated data, for each of the six methods of constructing confidence intervals (naive, Sampson and Sill, Wu et al., Posch et al. with Šidak correction, Posch et al. with Dunnett correction and Stallard and Todd assuming the values of the treatment effects for the dropped treatments are the naive maximum-likelihood estimates), we evaluate the confidence interval for the treatment difference between the selected treatment and the control and determine if the confidence interval is above, contains or is below inline image. We compute Stallard and Todd confidence intervals assuming the values of the treatment effects for the dropped treatments are the naive maximum-likelihood estimates because, for example, as we see in Section 'Worked example', this leads to confidence intervals with better properties than assuming the treatment effects of the dropped treatments are equal to zero. In the simulation results, we report the simulated probability that a confidence interval is above inline image and the simulated probability that a confidence interval contains inline image using the frequencies of the occurrences based on the 10,000 simulation runs. We refer to the simulated probabilities that the confidence intervals contain inline image as the coverage probabilities. In all simulations, the target is two-sided 95% confidence intervals so that we compare the simulated coverage probabilities to 0.95 and the simulated probabilities that a confidence interval is above inline image to 0.025. Except for the computations for Stallard and Todd confidence intervals, which were performed using a purpose-written C program, the computations in this section and Section 'Worked example' were performed in R statistical package (R Development Core Team, 2010). All the programs used for the computation and drawing the figures are available at https://files.warwick.ac.uk/nstallard/browse/adaptive. Except Fig. 4 that was drawn in S-Plus, all figures were drawn in R.

3.2 Simulation results

For simplicity, whilst reporting the simulation results, by ‘confidence interval’, we will be referring to the confidence interval for inline image. Figure 1 shows plots of selection time against the simulated probabilities that the confidence intervals are above inline image (captured by lower subplots) and the simulated coverage probabilities (captured by upper subplots) when two experimental treatments are tested in stage 1. The values of θ1 and θ2 are reported above each plot. The confidence interval that each line type corresponds to is given in the caption. Focussing on the naive confidence intervals (continuous lines), we observe that the coverage probabilities are approximately the desired 0.95 for all scenarios considered in Fig. 1. For the case when inline image (the top left panel), the probability that the naive confidence interval is above inline image is larger than the desired 0.025, and as the selection time increases, this probability increases. However, at each selection time, as θ2 increases and hence becomes distinct from θ1, the probability that the confidence interval is above inline image decreases. This is similar to the case of point estimation, where bias of the point estimate of inline image decreases as one experimental treatment becomes more superior compared to the other experimental treatments (Bauer et al., 2010).

image

Figure 1. Plots showing selection time against probability that confidence interval is above inline image (circles, lower subplots) and coverage probabilities (triangles, upper subplots). Continuous lines (—) correspond to naive confidence intervals (CIs), short dashed lines (- - -) to Sampson and Sill CIs, long dashed lines (– – –) to Wu et al. CIs, dotted lines (· · · · ) to Posch et al. (with Šidak correction) CIs, long and short dashed lines (– - – - –) to Posch et al. (with Dunnett correction) CIs and dashed and dotted lines (- · - · -) to Stallard and Todd CIs. The Wu et al. and Posch et al. (with Dunnett correction) confidence intervals have very similar properties so that their lines almost coincide leading to the appearance of a single line that has very long dashes that are of unequal length. In all plots inline image while θ2 is different in each plot.

Download figure to PowerPoint

For the Wu et al. confidence intervals (long dashed lines), when inline image, for all selection times, the probability that the confidence interval is above inline image is approximately 0.025. However, for each selection time point, as θ2 increases and hence becomes distinct from θ1, the probability that the confidence interval is above inline image decreases. This may be explained by the fact that, as explained in Section 'Wu et al. (2010)', the confidence bounds are obtained assuming all the treatment differences are equal, so that under this configuration the probability that the confidence interval is above inline image is approximately 0.025. However, the Wu et al. bounds would lead to conservative confidence intervals when the experimental treatments are not equally effective. The coverage probabilities in all plots are above the desired 0.95 and increases with selection time but the increase is less as the value of θ2 becomes more distinct from θ1. The increase is because, as we described in Section 'Wu et al. (2010)', we use the naive confidence interval upper bounds in the Wu et al. confidence intervals. When experimental treatments are equally effective and selection is made later in the trial, the shift in the density of the selected treatment is more compared to when selection is made earlier in the trial and so a naive confidence interval upper bound will lead to more conservative confidence intervals. However, if the most effective experimental treatment is distinct, selection later in the trial will lead to less bias of the upper bound of the confidence interval when the most effective treatment is selected to continue to stage 2 and hence the less the increase in coverage probability for such scenarios.

For the Posch et al. with Šidak correction confidence intervals (dotted lines), when inline image, for all selection time points, the probability that the confidence interval is above inline image is approximately 0.023. However, for each selection time point, as θ2 increases and hence becomes distinct from θ1, the probability that the confidence interval is above inline image decreases. This can be explained by how the adjusted p-values were defined in Section 'Posch et al. (2005)' to correspond to the least favourable configuration. The coverage probabilities in all plots are above 0.95 and increase with selection time. Thus the characteristics of the Posch et al. with Šidak correction confidence intervals are similar to the Wu et al. confidence intervals. The only difference is that the Posch et al. with Šidak correction confidence intervals are more conservative. One of the reasons for the conservativeness of the Posch et al. with Šidak correction confidence intervals compared to the Wu et al. confidence intervals is the use of a common control arm in stage 1. While determining the boundaries d in Eq. (4), we accounted for the correlation due to using the same control arm whereas for the Posch et al. method (Šidak correction), we assume the stage 1 pairwise p-values are independent, which is not the case because a common control arm is used in stage 1. For the Posch et al. with Šidak correction confidence intervals, as in the Wu et al. confidence intervals, we use the naive confidence interval upper bounds so that the differences for the probabilities that a confidence interval is above inline image are equal to the differences for the coverage probabilities.

The Posch et al. with Dunnett correction confidence intervals (long and short dashed lines) almost coincide with the Wu et al. confidence intervals so that they have similar properties. This is because for the Wu et al. confidence intervals, the critical values are obtained based on the distribution of the maximum treatment difference whilst accounting for the common control and similarly for the Posch et al. (with Dunnett correction) confidence intervals, the stage 1 Dunnett adjusted p-value is based on the maximum difference and accounts for the common control. The slight difference is because, for the Posch et al. confidence intervals, we transform the p-values to normal variates in order to combine evidence from stages 1 and 2 while for the Wu et al. confidence intervals, we use the sufficient statistics.

For the Sampson and Sill confidence intervals (short dashed lines), the coverage probabilities are approximately 0.95 for all selection times and all scenarios considered in Fig. 1. Also, the probabilities that the confidence interval is above inline image are close to 0.025 for all scenarios so that the Sampson and Sill confidence intervals are approximately symmetric.

Like the naive confidence intervals and the Sampson and Sill confidence intervals, the Stallard and Todd confidence intervals (dashed and dotted lines) have coverage probabilities approximately equal to 0.95 for all selection times and for all scenarios considered in Fig. 1. However, the probabilities that the confidence intervals are above inline image are higher for the Stallard and Todd confidence intervals compared to the Sampson and Sill confidence intervals, in some cases slightly higher than the target value of 0.025.

Figure 2 shows simulation results when, at stage 1, two experimental treatments are tested (Column 1), three experimental treatments are tested (Column 2) and four experimental treatments are tested (Column 3). The plots in Column 1 are the same as in the first column in Fig. 1, but on a different scale. We have used different scales because of the large differences between the simulated probabilities in Figs. 1 and 2. The top row corresponds to the scenario where all experimental treatments are equally effective while the bottom row corresponds to the scenario where the treatment effect of the most effective treatment is 0.15 while the treatment effects of the other treatments are 0. For the naive confidence intervals, we observe that, as the number of experimental treatments tested in stage 1 increases, the probability that the naive confidence interval is above inline image increases, sometimes exceeding 0.05 (e.g., see the top right panel). Also, the coverage probabilities for the naive confidence interval decreases as the number of experimental treatments tested in stage 1 increases and can be less than the desired 0.95 (see the middle and the right panels). As in Fig. 1, in Fig. 2 the coverage probabilities for the Sampson and Sill confidence intervals and the Stallard and Todd confidence intervals are approximately 0.95. For the two confidence intervals, the probability that the confidence interval is above inline image is not far from the desired 0.025 although sometimes they slightly exceed 0.025 (the maximum observed from Fig. 2 for the Sampson and Sill confidence intervals is 0.027 and for the Stallard and Todd confidence intervals is 0.03). For the Wu et al. confidence intervals and the Posch et al. (with Dunnett correction) confidence intervals, when all experimental treatments are equally effective (top row), the probability that the confidence interval is above inline image is approximately the desired 0.025. For the top row, the coverage probabilities for these two confidence intervals increase with the number of experimental treatments and this is because of using the naive confidence intervals upper limits, which leads to more conservative intervals as the number of experimental treatments increases. When one of the experimental treatments is more effective than the other experimental treatments (bottom row), the Wu et al. confidence intervals and the Posch et al. (with Dunnett correction) confidence intervals are conservative and they become more conservative as selection is made later in the trial. The Posch et al. (with Šidak correction) confidence intervals have similar properties to the Wu et al. confidence intervals and the Posch et al. (with Dunnett correction) confidence intervals except that they are slightly more conservative.

image

Figure 2. Probability that confidence interval is above inline image (circles) and coverage probabilities (triangles). Continuous lines (—) correspond to naive confidence intervals (CIs), short dashed lines (- - -) to Sampson and Sill CIs, long dashed lines (– – –) to Wu et al. CIs, dotted lines (· · · · ) to Posch et al. (with Šidak correction) CIs, long and short dashed lines (– - – - –) to Posch et al. (with Dunnett correction) CIs and dashed and dotted lines (- · - · -) to Stallard and Todd CIs. The number of treatments vary from 2 to 4 from left to right and the treatment differences are equal.

Download figure to PowerPoint

3.3 Summary findings from the simulation study

In Section 'Simulation results', we have reported a simulation study used to assess the properties of the various methods for constructing confidence intervals following phase II/III clinical trials. The results are based on a setting where there is no opportunity to stop at stage 1 so that the interim analysis is used entirely for treatment selection. We observed that the naive confidence interval achieves the right coverage (0.95) when two experimental treatments are tested in stage 1, but as the number of experimental treatments tested in stage 1 increases, the coverage becomes more inaccurate with coverage of the naive confidence interval quickly becoming less than 0.95. When the most effective experimental treatment is distinct with a very large margin, the probability that the naive confidence interval is above the true treatment difference is approximately 0.025 as desired. However, when the most effective treatments tie, this probability can rise above 0.05. Thus, we would not recommend using the naive confidence interval.

We have also observed that, as expected, the Posch et al. confidence intervals and the Wu et al. confidence intervals are similar. The only difference is that the Posch et al. confidence intervals can be more conservative than the Wu et al. confidence intervals if the adjusted p-values used to obtain the Posch et al. confidence intervals do not account for the correlation arising from using a common control arm. Both the Posch et al. and the Wu et al. confidence intervals are conservative (coverage probabilities were greater than the desired 0.95), and the conservativeness increases slightly as the number of treatments tested in stage 1 increases. The conservativeness of the Posch et al. and the Wu et al. confidence intervals was not that extreme (the maximum coverage probability observed was approximately 0.97). Also, as expected from the way these confidence intervals are constructed, we observed that both the probability that the confidence interval is above the true difference and the probability that the confidence interval is below the true treatment difference do not exceed 0.025. Based upon these results, we consider both the Posch et al. and the Wu et al. confidence intervals of constructing confidence intervals to perform well.

The Stallard and Todd confidence intervals achieved the desired coverage probability of approximately 0.95 for all scenarios considered in the simulation study. When two or three experimental treatments are tested in stage 1, the probability that the confidence interval is above inline image for some scenarios seems to exceed 0.025 so that the coverage of confidence intervals is asymmetric. The probability that the confidence interval is above inline image is higher if the most effective experimental treatments are similar compared to when the most effective treatment is distinct. However, when two or three experimental treatments are tested in stage 1, considering both the coverage probabilities and the probability that the confidence interval is above inline image, we would consider the Stallard and Todd confidence intervals to perform at least as well as the Posch et al. and the Wu et al. confidence intervals. However, for some scenarios (not presented in the paper), when four experimental treatments are tested in stage 1, the Stallard and Todd confidence intervals perform poorly when selection is done after halfway through the trial.

The Sampson and Sill method performed the best amongst the methods of constructing confidence intervals we reviewed. In all scenarios considered, the coverage probabilities were approximately 0.95 and the probabilities that the confidence intervals are above inline image are approximately 0.025 although sometimes they are slightly above 0.025.

Based on these findings, when there are no opportunities to stop at stage 1, our recommendation is that the naive confidence interval is not appropriate after an ASD. If the testing proposed by Sampson and Sill is to be used, we recommend using the Sampson and Sill confidence intervals. If the combination of p-values method is to be used to test the hypothesis, we recommend using the Posch et al. confidence intervals. If the group sequential method is to be used to test the hypothesis and two or three experimental treatments are tested in stage 1, we recommend using either the Stallard and Todd confidence intervals or the Wu et al. confidence intervals. If the group sequential method is to be used to test the hypothesis and more than three experimental treatments are tested in stage 1, we recommend using the Wu et al. confidence intervals. This ensures that the conclusions from the hypothesis test and confidence interval approximately coincide.

We have assumed above that a confidence interval is used to compliment hypothesis testing results. If, in contrast, the investigators want to use a confidence interval both for interval estimation and hypothesis testing, it is important to consider the probability that 0 is not included in the confidence interval, corresponding to the power of a hypothesis test corresponding to the confidence interval. Figure 3 shows coverage probabilities (top subplots), power (middle subplots) and the probability that the lower bound is greater than inline image (bottom subplots) for several scenarios based on different parameter values. The parameter values are given on top of the plots and on the x-axes and the line type each confidence interval corresponds to is reported in the caption. In all plots, inline image, inline image and inline image. As in Section 'Simulation results', for all scenarios, the Sampson and Sill confidence interval has the desired coverage probability of 95% and the probability that the lower bound is greater than inline image is approximately 0.025 leading to an approximately symmetric confidence interval. For all scenarios, the Wu et al. and the Posch et al. (with Dunnett correction) confidence intervals have approximately equal power, which is higher than the powers for the Posch et al. (with Šidak correction) and the Stallard and Todd confidence intervals. As has been observed by Wu et al. (2010), the Sampson and Sill confidence interval has higher power than the Wu et al. confidence interval for some scenarios where the most effective treatment is distinctly superior to the other experimental treatments while the Wu et al. confidence interval has higher power than the Sampson and Sill confidence interval when the treatment difference of the most effective treatment is not very different from the treatment differences of the other experimental treatments. Based on the findings in Fig. 3, for the case where a confidence interval is to be used for hypothesis testing and interval estimation, if an investigator desires to have symmetric confidence intervals that attain the desired coverage probability, the best option is to adequately power the trial for inference using the Sampson and Sill confidence interval. If to perform a hypothesis test with a higher power, the investigator can accept a confidence interval that is asymmetric and has higher coverage probability than the targeted coverage probability, the best option is to use a hybrid strategy where the lower bound is obtained using the Wu et al. or the Posch et al. methods because they have highest power in most scenarios and obtain the upper bound using the Sampson and Sill method.

image

Figure 3. Coverage probability (top subplots), power that is the probability that lower bound is greater than 0 (middle subplots) and the probability that the lower bound is greater than inline image (bottom subplots). The top left panel corresponds to the case where inline image. The other panels are for different scenarios when inline image. Short dashed lines (- - -) correspond to Sampson and Sill CIs, long dashed lines (– – –) to Wu et al. CIs, dotted lines (· · · · ) to Posch et al. (with Šidak correction) CIs, long and short dashed lines (– - – - –) to Posch et al. (with Dunnett correction) CIs and dashed and dotted lines (- · - · -) to Stallard and Todd CIs.

Download figure to PowerPoint

4 Worked example

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Confidence interval calculation for the treatment difference after an ASD
  5. 3 A comparison of the confidence intervals using a simulation study
  6. 4 Worked example
  7. 5 Discussion
  8. Acknowledgements
  9. Conflict of interest
  10. References

Because all the confidence intervals perform well, but as would be expected from the simulation study the various methods would lead to different intervals for a given data set, we demonstrate how to compute each of the confidence intervals. We use an example based on the trial reported by Wilkinson and Murray (2001), which was introduced in Section 'Introduction'. For the purpose of our example, we assume a two-stage trial was undertaken and selection is based on efficacy data only. We assume treatment selection made after 20 patients have been randomised to each treatment arm and, at stage 2, an additional 80 patients randomly allocated to placebo and to the selected dose of galantamine, making a total of 100 allocated each to the control and to the selected treatment.

The analysis of the trial data by Wilkinson and Murray compared changes in ADAS-cog scores from baseline at 12 weeks of treatment. The changes were assumed to be normally distributed with a common standard deviation across the four groups. In this example, we calculate the changes from baseline such that a higher score is desired. We assume the estimates in the trial by Wilkinson and Murray, which for the placebo,18 mg/day, 24 mg/day and 36 mg/day of galantamine are, respectively, −1.6, 0.1, 1.4 and 0.7, are the true values for the means and we use them to obtain two-stage phase II/III clinical trial data. Using the observed data in Wilkinson and Murray, we assume the common standard deviation for change from baseline for ADAS-cog is 6.17. Based on these values for the means and standard deviation, suppose that the observed stage 1 means from a phase II/III clinical trial are as given in Column 2 of Table 1. Based on the observed stage 1 data, 36 mg/day of galantamine and placebo would be tested further in stage 2. We suppose the results from stage 2 are as given in Column 3.

Table 1. Example of observed changes in ADAS-cog mean scores from baseline at 12 weeks for a hypothetical two-stage adaptive seamless trial based on the trial by Wilkinson and Murray (2001)
TreatmentStage 1Stage 2
Placebo−1.35−1.61
18 mg galantamine0.18
24 mg galantamine1.18
36 mg galantamine1.200.69

The various 95% confidence intervals for the treatment difference between the 36 mg/day of galantamine and placebo (36 mg galantamine − placebo) are given in Table 2. They are computed as follows. The difference in means between the placebo and 36 mg galantamine is inline image. Following expression (1), the naive confidence interval is inline image. For the Sampson and Sill confidence interval, as described in Section 'Sampson and Sill (2005)', we perform a search using Eq. (3) to obtain the lower and upper bounds. Using a program written in the R statistical package, the Sampson and Sill confidence interval is (0.249, 3.814). As expected, the Sampson and Sill lower bound is smaller than the naive lower bound because the Sampson and Sill confidence interval accounts for treatment selection. The Sampson and Sill upper bound is smaller than the naive upper bound but, compared to the lower bounds, the Sampson and Sill upper bound is more similar to the naive upper bound. This is because the naive upper bound approximately corresponds to the least favourable configuration.

Table 2. 95% confidence intervals for data in Table 1
MethodLower boundUpper bound
Naive0.6404.060
Sampson and Sill0.2493.814
Wu et al.0.4434.060
Posch et al. (with Šidak correction)0.3594.060
Posch et al. (with Dunnett correction)0.4344.060
Stallard and Todd (inline image)0.4804.036
Stallard and Todd (θ1 and θ2 given by naive MLEs)0.2763.935

For the Wu et al. confidence interval, when selection is at selection time 0.2 and three experimental treatments are tested in stage 1, inline image. Therefore, following Section 'Wu et al. (2010)', the lower bound is inline image. Wu et al. propose using the upper bound from the naive confidence interval so that the Wu et al. confidence interval is (0.443, 4.060). As expected, the Wu et al. lower bound is smaller than the naive lower bound because it accounts for treatment selection. However, the value of the Wu et al. lower bound is bigger than the Sampson and Sill lower bound. A reason for this could be the fact that the Wu et al. bounds depend on the number of experimental treatments, selection time and the treatment effect of the apparently highest effective treatment while the Sampson and Sill bounds additionally depend on the treatment effect of the apparently second highest effective treatment. The Wu et al. lower bound would remain the same as long as the estimates for the 18 mg/day and 24 mg/day galantamine doses are below 1.20 (the estimate for 36 mg/day) while the Sampson and Sill bounds, as can be deduced from expression (3), would be different because the bounds depend on the difference between the observed mean of the apparently second highest effective treatment and the observed mean of the selected treatment.

Figure 4A gives part of the lower bounding surface for the confidence region obtained using the Posch et al. method when the Šidak correction is used. The parameters θ1, θ2 and θ3, respectively, denote the treatment effects of doses 18 mg, 24 mg and 36 mg of galantamine. The upper bounding surface (not presented) is a flat plane because as described in Section 'Posch et al. (2005)', for the upper bound, we use the naive confidence interval upper bound. The upper bounding surface of the confidence region is at inline image (the naive confidence interval upper bound) for all values of θ1 and θ2. For both θ1 and θ2, the lower and upper bounds are inline image and ∞. This is because no stage 2 data are available to test hypotheses corresponding to these parameters and so their stage 2 adjusted p-values are set equal to 1 and with the inverse normal combination method, no hypotheses concerning these parameters can be rejected. The confidence interval can be obtained by embedding a hyper-rectangle over the confidence region. However, this is not necessary and the confidence interval can be obtained by solving a single root using expression (8) to give the lowest value for θ3. The Posch et al. (with Šidak correction) confidence interval is (0.359, 4.060). As expected from the explanation of results in Section 'Simulation results', the Posch et al. (with Šidak correction) lower bound is considerably smaller than the Wu et al. lower bound. Figure 4B gives part of the lower bounding surface for the confidence region obtained using the Posch et al. method when the Dunnett correction is used. The surface is higher compared to Fig. 4A because the Dunnett correction, unlike the Šidak correction, accounts for the use of a common control arm and this makes it less conservative. The Posch et al. (with Dunnett correction) confidence interval is (0.434, 4.060). As expected from the explanation of results in Section 'Simulation results', the Posch et al. (with Dunnett correction) lower bound is slightly different from the Wu et al. lower bound.

image

Figure 4. Panels (A)–(C), respectively, give the Posch et. al (with Šidak correction), Posch et. al (with Dunnett correction) and Stallard and Todd lower bounding surfaces of the confidence regions. Panel (D) gives the upper bounding surface of the Stallard and Todd confidence region.

Download figure to PowerPoint

Figure 4C and D, respectively, show parts of the lower and upper bounding surfaces for the confidence region obtained using the Stallard and Todd method. The Stallard and Todd lower bounding surface is lower than the Posch et al. lower bounding surfaces. From the Stallard and Todd lower bounding surface, the lower bound for θ3 decreases as the values of θ1 and θ2 increase. This is similar to the Posch et al. confidence regions. The upper bound for θ3 also decreases as the values of θ1 and θ2 increase. As described in Section 'Stallard and Todd (2005)', Stallard and Todd suggest obtaining the confidence interval for θ3 by assuming inline image or by assuming the values of θ1 and θ2 are given by the naive maximum-likelihood estimates, which for this example are, respectively, 1.53 and 2.53. Using Figure 4C and D, assuming inline image, the Stallard and Todd confidence interval for θ3 is (0.480, 4.036), which is closest to the naive confidence interval. The reason for this may be that these values of θ1 and θ2 lead to a high probability of selecting dose 36 mg so that the selection bias is almost zero and so only a small adjustment of the naive confidence interval is required. If we assume θ1 and θ2 are given by the naive maximum-likelihood estimates, the Stallard and Todd confidence interval for θ3 is (0.276, 3.935), which is close to the confidence interval obtained using the Sampson and Sill method.

5 Discussion

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Confidence interval calculation for the treatment difference after an ASD
  5. 3 A comparison of the confidence intervals using a simulation study
  6. 4 Worked example
  7. 5 Discussion
  8. Acknowledgements
  9. Conflict of interest
  10. References

Innovation in clinical trial design has led to seamless phase II/III clinical trials. These trials are efficient because data used to answer phase II objectives such as treatment selection are used in the final phase III confirmatory analysis. However, making inferences in seamless phase II/III clinical trials poses statistical challenges associated with comparing several experimental treatments to a control and with combining evidence from the phase II stage and evidence from the phase III stage. In this paper, we have considered construction of confidence intervals after a seamless phase II/III clinical trial. We have compared various methods of constructing confidence intervals that we have identified in the literature. We considered the case where the apparently most effective experimental treatment and the control continue to stage 2 and there is no opportunity to stop at stage 1. The comparison has focussed on the confidence interval for the treatment difference between the selected treatment and the control.

We have compared the methods by assessing their coverage probabilities and the probability that a confidence interval is above the true treatment difference of the selected treatment. Based on these properties, we observed that the naive confidence intervals are inappropriate and they become more inappropriate as the number of experimental treatments in stage 1 increases and as the selection is done later in the trial. The Sampson and Sill (2005) confidence intervals perform best. However, all the other confidence intervals that account for the treatment selection perform well in most scenarios. Hence, if a confidence interval is to be used to complement hypothesis testing results, we recommend using the confidence intervals that correspond to the method used to test the hypotheses, so that the confidence intervals are approximately consistent with hypothesis testing. Because sometimes the investigators may want to use the confidence interval both for hypothesis testing and interval estimation, we also assessed the probability that the various confidence intervals exclude the null value of 0. Wu et al. and Posch et al. (with Dunnett correction) confidence intervals have the best power for testing superiority of the selected treatment. Hence another option of obtaining a confidence interval is to use a hybrid strategy where the lower bound is obtained using the Wu et al. method or the Posch et al. method because this leads to higher power and the upper bound is obtained using the Sampson and Sill method.

The findings in this paper are based on the case where the trial always continues to stage 2 and the sample sizes n1 and n2 are fixed. The Sampson and Sill (2005) method of constructing confidence intervals was developed specifically for this case and so for this method, a confidence interval can only be obtained if the trial continues to stage 2 and it is valid for fixed n2. The Stallard and Todd (2005) method of constructing confidence intervals allows stopping early for futility or efficacy based on stage 1 results but not re-estimating n2. The methods of Posch et al. (2005) and Wu et al. (2010) allow both stopping early for futility or efficacy and re-estimating n2 and we anticipate their confidence intervals to approximately coincide if, for the Posch et al. method, the Dunnett correction is used.

We did not review the methods suggested by Neal et al. (2011) and Magirr et al. (2013) because for the case considered in this paper, where the apparently most effective treatment is selected to continue to stage 2, they would, respectively, give the same results as the methods of Wu et al. (2010) and Posch et al. (2005). Neal et al. (2011) propose a lower bound for the case when the selected treatment does not have to be the apparently most effective. They use the same formula for the lower bound as Wu et al. (2010) including using a critical value obtained assuming the most effective treatment is selected. Unlike Posch et al. (2005) who only consider the global intersection hypothesis, Magirr et al. (2013) consider all the hypotheses in the closure set to construct confidence intervals that are consistent with the hypothesis testing. When the apparently most effective treatment is selected, it is sufficient to use the global intersection hypothesis to test the hypothesis corresponding to the selected treatment (Maurer et al., 2009) so that the Posch et al. and the Magirr et al. confidence intervals coincide for this case.

In this paper, we did not consider the setting where the selected experimental treatment is not the apparently most effective. The Sampson and Sill (2005) and the Stallard and Todd (2005) methods cannot be applied for this setting because they assume the apparently most effective experimental treatment is selected to continue to stage 2. The methods that can be applied for this setting are the methods proposed by Posch et al. (2005), Neal et al. (2011) and Magirr et al. (2013). Neal et al. (2011) have theoretically proved that their method has the desired coverage. However, they only derive the lower bound. We anticipate that the Posch et al. confidence intervals may have undesired characteristics because this method does not consider all the hypotheses in the closure set. The Magirr et al. confidence intervals will have the desired characteristics for any selection rule because these authors consider all hypotheses in the closure set while obtaining the confidence intervals. Although Magirr et al. (2013) derive the lower bounds, the upper bounds can be derived. The setting where the selected treatment may not be the apparently most effective is the topic of our further work in the methods for constructing simultaneous confidence intervals. We intend to develop new methods where the selected treatment does not have to be the apparently most effective and compare them to the existing methods.

Acknowledgements

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Confidence interval calculation for the treatment difference after an ASD
  5. 3 A comparison of the confidence intervals using a simulation study
  6. 4 Worked example
  7. 5 Discussion
  8. Acknowledgements
  9. Conflict of interest
  10. References

The authors are very grateful to the editor, the associate editor and two reviewers for comments that greatly improved this work. The work was supported in part by the Targeted Research Programme of the Medicines and Healthcare products Regulatory Agency (MHRA).

References

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 Confidence interval calculation for the treatment difference after an ASD
  5. 3 A comparison of the confidence intervals using a simulation study
  6. 4 Worked example
  7. 5 Discussion
  8. Acknowledgements
  9. Conflict of interest
  10. References