Simultaneous confidence intervals for ratios with application to the gold standard design with more than one experimental treatment

Considering a study design with two experimental treatments, a reference treatment and a placebo, we extend a previous approach considering the ratios of effects to a procedure for analyzing multiple ratios. The technical framework for constructing tests and compatible simultaneous confidence intervals is set in a general manner. Besides a single step procedure and its extension to a stepdown procedure, also, an informative stepwise procedure in the spirit of our previous work is developed. The latter is especially interesting, because noninferiority studies require informative confidence intervals to infer more information than just noninferiority at the prespecified margin. Results from a simulation study for the three methods are shown. We also argue that an extension to more than two experimental treatments is straightforward.


INTRODUCTION AND NOTATIONS
Consider the gold standard design for clinical trials, which includes a new treatment (called experimental treatment), a standard treatment for the given indication (called reference) and a placebo. This design is particularly recommended if the reference effect has proven unstable in earlier trials and if the inclusion of a placebo group is ethically acceptable. To establish the efficacy of the experimental treatment, noninferiority to the reference has to be shown. Furthermore, the assay sensitivity of the clinical trial has to be established. This means that noninferiority to the reference is not sufficient, because it can already be shown if neither the reference nor the experimental treatment have any effect over placebo. Therefore, the placebo group is included in the trial.
Three-armed clinical trials have been proposed in the works of Pigeot et al 1 and Koch and Röhmel. 2 The first uses the ratio of the effects over placebo of the experimental and the reference treatment as parameter of interest, thus requiring the superiority of reference over placebo. The latter proposes a hierarchical strategy, testing first the superiority of the experimental treatment over placebo and then the noninferiority of the experimental treatment to the reference. Both the superiority of the reference and the superiority of the experimental treatment over placebo assure assay sensitivity.
We will in the following pursue the approach by Pigeot et al, 1 ie, we consider the null hypothesis (compared to placebo). Such a proof of efficacy is useful, since the clinical relevant effect of the experimental treatment over placebo is frequently described as a proportion of the reference's effect (see the work of Hauschke and Pigeot 3 ). Noninferiority of the experimental treatment to the reference is shown if H g is rejected for a suitable noninferiority margin g, which is defined in the planning of the clinical trial. Normally, g is some value smaller than 1, but larger than 1∕2. Note that the superiority R > P is needed as prerequisite for this approach. Due to the hierarchical order, H can be tested at full level after the reference's superiority has been established. Especially in noninferiority studies, knowledge about the quantity of the actual effect is more important than sheer rejection of the null hypothesis. Establishing a confidence interval is a useful possibility to test all H simultaneously and thus gain information without the need of adjusting the significance level. For example, rejection of H 1 would be a proof of the superiority of the experimental treatment over the reference. The set of all ∈ ℝ, for which the corresponding null hypothesis H is not rejected, constitutes a natural left-sided confidence interval for the ratio = ( E − P )∕( R − P ). In the work of Pigeot et al, 1 the construction of this confidence interval is performed using Fieller's method. 4 In this paper, we want to extend the idea of Pigeot et al 1 to situations, where two experimental treatments are to be tested simultaneously in a gold standard trial. For example, one might want to examine two different doses of a new drug. Then, an adjustment of the multiple type I error is needed. Since we are particularly interested in confidence intervals, our goal is to construct simultaneous confidence intervals (SCIs) for where E i is the effect of treatment i for i = 1, 2. We therefore construct multiple test procedures for the hypotheses For the sake of lucidity, we will restrict ourselves to the case of two experimental treatments, although a generalization to more than two treatments is straightforward (see Remark 2 as follows). The construction of SCIs for multiple ratios has been considered in the work of Dilba et al 5 for the special case of the comparison of several treatments with a control, and in the work of Dilba et al 6 for the general case. We assume normally distributed observations and denote by X s the observed mean effect in group s for s ∈ {E 1 , E 2 , P, R}. Variance homogeneity within the four groups is assumed. A natural test statistic for H i, i is where S is the pooled variance estimator from the four groups and v i ( i ) is the variance of the numerator of T i, i divided by the variance of each single observation. Explicitly, where n s is the number of observations in group s. Under H i, i , the test statistic T i, i has a t-distribution with = n E 1 + n E 2 + n R + n P − 4 degrees of freedom. The SCI estimation for 1 and 2 is complicated by the fact that the correlation of T 1, 1 and T 2, 2 depends on the unknown parameters. There are several proposals to replace this correlation by an observed quantity; see the work of Dilba et al 6 and Section 4 for details.
It is well-known that stepwise procedures are more powerful than single-step procedures. 7 Hence, if the major interest of the clinical trial consists in establishing noninferiority of preferably both new treatments, then a stepwise procedure might be preferred. However, in noninferiority studies, confidence intervals are normally indispensable, and it is not trivial to obtain SCIs with nice properties, which are compatible with the rejection decisions of a stepwise multiple test procedure. To our knowledge, all existing SCIs for ratios are derived in a single step procedure, ie, the confidence bound for i does not depend on X E with i ≠ j. Only the work of Bretz et al 8 considered stepwise SCIs for ratios; however, they restrict themselves to dose-finding studies, where a natural hierarchical ordering is present. Besides, the hierarchical SCIs provide information only if all hypotheses are rejected. The latter point might be solved by applying the SCIs of Schmidt and Brannath. 9 However, in our application to the gold standard design with two experimental treatments, there is no clear hierarchy of testing; therefore, we need another approach. Strassburger and Bretz 10 proposed a construction of SCIs for a large class of closed testing procedures. The drawback of these SCIs is that the confidence interval is often equal to the alternative of the null hypothesis if it is rejected, thus giving no information on the quantity of the effect. To overcome the problem of such noninformative rejections, Brannath and Schmidt 11 derived SCIs, which constitute a flexible compromise between the powerful stepwise SCIs and the informative single-step SCIs. In this paper, we give a general approach to obtain SCIs, which is based on the partitioning lemma. 12 It includes the standard single-step SCI, a stepwise extension, and a weighted approach inspired by Brannath and Schmidt. 11 We will compare these three alternative constructions by simulations in the context of a gold standard trial with two experimental treatments.
Our article is arranged as follows. In Section 2, we explain in a very general manner how to construct SCIs for two parameters. We present stepdown SCIs similar to the work of Strassburger and Bretz 10 and weighted SCIs in the spirit of Brannath and Schmidt. 11 In Section 3, we construct SCIs for the gold standard design with two experimental treatments giving concrete formulas for the confidence bounds. We extend the approach of Dilba et al 6 who constructed single-step SCIs for ratios of linear combinations of treatment means, to stepwise tests and compatible SCIs. In Section 4, we analyze the three methods by simulations in clinical trials with different settings. In Section 5, we discuss our results and mention possible extensions of the methods presented here. In particular, we explain how our approach is generalized to more than two treatments.

SINGLE-STEP AND STEPWISE SCIS
We consider the general situation of testing two hypotheses H i ∶ i ≤ g, i = 1, 2, against their alternatives A i ∶ i > g at significance level ∈ (0, 1), via a SCI (L 1 , ∞) × (L 2 , ∞) with coverage probability 1 − . That is, we reject H i if and only if L i ≥ g. Note that we choose for simplicity the same margin g for both hypotheses. If there is a practical reason to use different margins g 1 and g 2 , this can be realized by obvious replacements of g by g 1 or g 2 in the formulas as follows.
By the duality of tests and confidence sets, 13 we can construct individual tests ( ) ∈ {0, 1} with significance level for every = ( 1 , 2 ) ∈ ℝ 2 . More precisely, ( ) = 1 means that . A SCI with coverage probability not smaller than 1 − is given by the smallest rectangle (L 1 , ∞) × (L 2 , ∞) containing C. Frequently, the tests ( ) are constructed in a way that C already forms a rectangle and thus is equal to the SCI.
To define the tests ( ) in the following derivation of the single-step, stepwise, and weighted SCI, we need the one-dimensional tests for H i, i at any significance level ∈ (0, 1), which we denote by i ( i ). It is natural to assume that every i is monotonically decreasing in i and that ∕2 i ≤ i , ie, a parameter value that is rejected at level ∕2, is rejected at level as well.

Single-step SCI
We choose individual tests of the form Then, ( ) is an -test by the Bonferroni inequality. We obtain The confidence bound for i does not depend on the confidence bound for j with j ≠ i; hence, we obtain a single-step SCI. We also note that this SCI does not depend on the value of the noninferiority margin g, which is an important difference to the stepwise SCI and weighted SCI considered next.

Stepwise SCI
In order to improve the probabilities to reject H 1 and H 2 , we partition the parameter space ℝ 2 in four disjoint regions and consider different individual tests in each region. Let ℝ I be the region of parameters, for which I is the index set of true hypotheses. More precisely, The stepwise SCI is determined by different individual -tests I in each region ℝ I , which are given by By our choice of {1} and {2} , we put more emphasis on the relevant null hypotheses and increase the power to reject them. In the regions ℝ {1,2} and ℝ ∅ , where both 1 and 2 belong either to the null or to the alternative hypothesis, equal weight is given to both parameters. Note that H i is rejected if and only if {i} (g) = 1 and {1,2} (g) = 1. There are four cases, which are illustrated by Figure 1 Case 1: (both null hypotheses accepted), Case 2: ∕2 2 (g) = 1 and 1 (g) = 0 (onlyH 2 rejected), Case 3: ∕2 1 (g) = 1 and 2 (g) = 0 (onlyH 1 rejected), Case 4: max{ ∕2 1 (g), ∕2 2 (g)} = 1 and 1 (g) = 2 (g) = 1 (both null hypotheses rejected).

(4)
For i = 1, 2, denote by L I i the confidence bound in region ℝ I . Formally, ie, the bound L I i excludes those values i , where all ∈ ℝ I with this component value are rejected by I . The confidence set is now a union of all confidence sets within the four regions, ie, the final bound L i is the lowest of the four L I i (see Figure 1). Due to the monotonicity properties of the one-dimensional tests i , we can explicitly determine the SCI (L 1 , ∞) × (L 2 , ∞). The bound L 1 is explicitly given as These formulas for the stepwise SCI are basically an application of the method in the work of Strassburger and Bretz 10 for the Bonferroni-Holm procedure. The confidence bound L 2 is obtained by interchanging the indices 1 and 2 in (5). (7) as a function of 1 . The variable 2 is fixed to be some value smaller than g (left figure) or larger than g (right figure)

Weighted SCI
As can be seen from (5), the confidence bound for L 1 may equal g if H 1 is rejected. This means essentially rejecting H 1 without obtaining more information by the confidence interval, because (L 1 , ∞) is the complete alternative hypothesis. To avoid these noninformative rejections and still obtain more power than the single-step SCI, we propose a third approach, which is based on the ideas of Brannath and Schmidt. 11 We have so far not exploited the fact that the individual tests ( ) may also depend on in a continuous way. Therefore, we introduce continuous weight functions w i ( ) for i = 1, 2, which satisfy w 1 ( ) + w 2 ( ) = 1 for each ∈ ℝ 2 . Then, we put Again, this is an -test by the Bonferroni inequality. With the same idea as for the stepdown test, we define our weight functions separately on each of the four partitions of ℝ 2 . Let E(x) be any continuously decreasing function on [g, ∞) with The weight function w 1 is given by Then, w 2 ( ) = 1 − w 1 ( ) has an analogous shape to w 1 ( ). Note that w i is decreasing in i and increasing in j for j ≠ i. Figure 2 illustrates the shape of the weight function and compares it to the respective weights for the single-step and the stepdown interval. One sees that the weight function can be seen as an interpolation between the two extremes, where parameter values that are smaller than g obtain equal weight as values larger than g (single-step) or maximal weight (stepdown) in the dual test.
We will show in Section 3 that, under mild assumptions, the SCI (L 1 , ∞) × (L 2 , ∞), which results from the individual tests (6), can be determined by a simple bisection algorithm.

APPLICATION TO THE GOLD STANDARD DESIGN WITH TWO EXPERIMENTAL TREATMENTS
We now want to obtain SCIs with coverage probability 1 − for the parameters 1 and 2 in (1), ie, for the ratio of the effects of two experimental treatments (compared to placebo) relative to the effect of the reference (compared to placebo). In general, confidence intervals for ratios can be obtained by Fieller's method. 4 As we have mentioned before, the test statistic in (2) has, under H i, i , a t-distribution with = n E 1 + n E 2 + n R + n P − 4 degrees of freedom.

One-dimensional confidence interval
We first describe the construction of a one-dimensional confidence interval with coverage probability 1− . It can be found in the work of Dilba et al 6 for general ratios and in the work of Pigeot et al 1 for the given situation. By the duality of tests and confidence intervals, a one-sided one-dimensional confidence set with coverage probability 1 − for i is given as where q t ,1− denotes the (1 − )-quantile of the t-distribution with degrees of freedom. Because T i, i is not monotone in i , we need the following assumption, which states that the reference is significantly superior (at significance level ∕2) to the placebo. Assumption 1 (Reference superiority).
It was stated in the work of Dilba et al 6  In the following, we use the one-dimensional tests i ( i ) to describe the single-step, stepdown, and weighted SCI for 1 and 2 . In terms of Section 2, these tests have the form We note that our assumptions of Section 2 are satisfied, namely, i ( i ) is decreasing in i and

Single-step SCI
The single-step SCI with coverage probability (1 − ) is the set of all ∈ ℝ 2 that are not rejected by the multiple test ( ) = max{ ∕2 1 ( 1 ), ∕2 2 ( 2 )}. This simply leads to the single-step SCI (L 1 , ∞) × (L 2 , ∞) with L i = i ( ∕2) for i = 1, 2. Remark 1. Instead of using Bonferroni's inequality, it would of course be more efficient to work with the joint distribution of T 1, 1 and T 2, 2 . As is pointed out in the work of Dilba et al, 6 the problem is here that the correlation of the two test statistics depends on the unknown parameter . Less conservative methods than Bonferroni are proposed to replace this correlation by an observed quantity, such that the coverage probability is conserved exactly or approximately. For better comparison with the other methods introduced here, we will only describe the Bonferroni approach in detail. However, in our simulations in Section 4, we will compare our SCIs also with the single-step SCIs obtained by two alternative approaches explained in the work of Dilba et al. 6
These cases can also be described by the test statistics, because i ( ) < g is equivalent to the corresponding relation T i,g < q t ,1− . Figure 3 illustrates the four cases within the dimensions of T 1,g and T 2,g . The resulting expression for the FIGURE 3 Illustration of the four cases in (9). The set R denotes the set of rejected hypotheses and q is the quantile q t ,1− confidence bounds is and, respectively, L 2 by interchanging the indices 1 and 2.

Weighted SCI
We now consider the approach (6), where the individual tests use different weights for different hypotheses H i, i . These weights depend continuously on the parameter via the function given in (7). The procedure to obtain the weighted SCI is following the ideas of Brannath and Schmidt. 11 We find a suitable one-dimensional set M ⊆ ℝ 2 and functions f i (for i = 1, 2) that are continuous and increasing in i on M with inf M i ≤ ≤ sup M i . Furthermore, we show that the simultaneous confidence bound L = (L 1 , L 2 ) satisfies L ∈ M and f 1 (L) = f 2 (L) = . Then, we can construct an easy bisection search to determine the weighted SCI. We explain this approach in more detail in the Supporting Information.
Due to the nonmonotonicity of the test statistics T i, i , we cannot directly apply the approach of Brannath and Schmidt, 11 which supposes monotone p-values. Therefore, we need to restrict the search for the confidence bounds on a two-dimensional interval, where the test statistics are decreasing. Since the derivation of such an interval is rather technical, we defer it to the Appendix. We only mention that the following restriction for the sample sizes is required.
Assumption 2 (Sample sizes). Let g ∈ (0, 1) be the noninferiority margin for the relative effect. Then, we assume The best strategy to satisfy Assumption 2 is to choose equality here, because for ethical reasons, one wants not more patients in the placebo group than necessary. Actually, Pigeot et al 1 suggested exactly this ratio as optimal allocation quotient between n P and n R . Hence, Assumption 2 is not seriously restrictive.

SIMULATION STUDY
We conducted a simulation for a study with two experimental treatments, a reference and a placebo. The R programs for this study are available from the corresponding author upon request. We set the standard deviation of all single observations to 1. The effects of the placebo and reference were 0 and 1, respectively. The noninferiority margin g was set to 0.8. Treatment E 1 was assumed to be as effective as the reference (effect 1), while we varied the effect of treatment E 2 from noninferior to the reference (effect 0.8) to superior to the reference (effect 1.4).
We chose the one-sided significance level = 0.025. Since we have multiple hypotheses, we used ∕2 for the sample size calculation to obtain sufficient power for the single-step procedure with Bonferroni correction. We calculated the sample size for the experimental treatments according to formula (26) of the work of Pigeot et al 1 such that noninferiority can be shown with probability 1 − = 0.8 if their actual effect is 1 and the reference effect is 1 as well: where z is the -quantile of the standard normal distribution. According to formulas (27) and (28) of the work of Pigeot et al 1 and to assure that Assumption 2 is satisfied, we further put n R = gn E 1 = 381 and n P = (1 − g)n E 1 = 1 − g g n R = 96.
These sample sizes led to an effective power of 1 to show superiority of the reference over placebo, ie, Assumption 1 was satisfied in all 10 000 simulation runs. We consider the three procedures with Bonferroni correction that we have described in the previous section (single-step, stepdown, and weighted procedure with exponential weight function). Dilba et al 7 have proposed alternative single-step procedures involving the multivariate distribution of T 1 and T 2 , where the unknown correlation is estimated in different ways. The resulting procedure works like the single-step test with Bonferroni correction, but with different quantiles than t ,1− /2 as critical points. The "MtI" approach uses the quantiles of the multivariate t -distribution, where the identity matrix is taken as correlation matrix. The "plug-in" method replaces the unknown correlation between T 1 and T 2 by an estimate obtained from the data. It guarantees the desired coverage probability only asymptotically. For comparison, we show results also for these two alternatives of the single-step procedure and an according adaption of the stepdown procedure. Table 1 shows the simulation results for the case, where both experimental treatments are as effective as the reference.

Results for treatment E 1 if
Results for E 1 are indicated; of course, the results for E 2 are very similar. Coverage is good for all procedures and slightly smaller for the plug-in approaches as expected. The stepdown procedure can show noninferiority in more cases than the other procedures; however, the confidence bound then equals often exactly the noninferiority margin g. Hence, the probability for what we call informative noninferiority, ie, a confidence bound indicating a positive effect above the noninferiority margin, is even lower than for the single-step procedures. The weighted approach does not have this problem and exhibits a marginal power gain compared to the single-step procedure. The median confidence bounds (and also the means) are quite similar for all methods. 2   Tables 2 and 3 show the probabilities to prove noninferiority, informative noninferiority, and superiority for the second treatment, where the true effect 2 of E 2 is varying. The weighted method is compared to the Bonferroni-corrected single-step and stepdown procedure. Only the latter has different results for noninferiority and informative noninferiority. While the power to show noninferiority is the largest for the stepdown approach for all values of 2 , its probability to obtain informative noninferiority is much smaller. Accordingly, superiority can be shown in less cases, even if the effect of E 2 is very large. The weighted intervals yield uniformly larger probabilities to obtain informative noninferiority than    the single-step and the stepdown intervals. The small power gain toward the single-step approach is paid by a slightly smaller probability to show superiority. Figure 4 shows how the power to show noninferiority of E 1 and the mean confidence bound L 1 = L E 1 −R develop in dependence of 2 . Although the stepdown procedure is superior in power, its mean bound is very small if 2 is small. The weighted procedure is slightly superior to the single step procedure in power and has a higher mean bound if 2 is not too small.

Results for treatments E 1 and E 2 in dependence of E
To summarize, in practical application where informative SCIs are important, the power gains of the stepdown procedure are in conflict with a high risk of confidence bounds that are equal to the noninferiority margin, even if the treatment is superior to the reference. If the only goal of the study is to obtain noninferiority for both experimental treatments, then the stepdown procedure is a good choice. Otherwise, the weighted approach is a good alternative, because it is more stable and always informative, with slightly larger power than the single-step approach.

DISCUSSION AND EXTENSIONS
The gold standard design is very relevant for practical applications and the question of multiple adjustment if more than one experimental treatment is involved arises naturally. In this paper, we gave a first insight on how SCIs can be constructed in this case. We extended the existing single-step procedure for ratios, first, to a more powerful method in the spirit of Strassburger and Bretz 11 and second, to a more powerful and informative method in the spirit of Brannath and Schmidt. 11 As a theoretical novelty, we have extended the Fieller confidence intervals for multiple ratios, as proposed in the work of Dilba et al 6 in a very general manner to a stepdown procedure. Since the corresponding SCIs of Strassburger and Bretz 10 are often noninformative, which is rarely acceptable in noninferiority trials, we have investigated an alternative construction, which is based on a more flexible weight function. The new approaches were analyzed for the gold standard design with two experimental treatments. As the following remark indicates, a generalization to more than two experimental treatments is straightforward.
Remark 2. Assume that we have m experimental treatments and thus need to construct an SCI for the m-dimensional parameter = ( 1 , … , m ) with i as in (1). If we replace the quantile q t ,1− ∕2 in Assumption 1 by q t ,1− ∕m , then all three procedures considered here can easily be applied to this general case.
The single-step interval construction uses for i = 1, … , m the one-dimensional intervals C i = (L i , ∞), where L i = i ( ) with = ∕m is the solution i of the equation T i, i = q t ,1− . The stepdown interval construction applies the partitioning principle to the Bonferroni-Holm test for the hypotheses H i ∶ i ≤ g, i = 1, … , m. This leads to the generalization of (9), which is given by formula (10) of the work of Strassburger and Bretz. 10 More precisely, where R is the set of rejected hypotheses. The weighted interval construction generalizes the weight function (7) as follows. Let E(x) be any continuous and decreasing function on ℝ with E(x) = 1 for x ≤ g and E(x) → 0 as x → ∞. Then, Each ∈ ℝ m is rejected if and only if any of the m individual hypotheses H i, i ∶ i = i is rejected at level w i ( ) . The weighted SCI is the smallest rectangle containing all that are not rejected. A bisection search to explicitly find the confidence bounds is derived with the same ideas as explained above and in the Appendix. Since the intervals where the test statistics are monotone are constructed individually for each i = 1, … , m, the generalization of this approach is straightforward.
We have seen that the stepdown procedure considerably increases power to reject the noninferiority hypotheses, but the confidence bound itself is often useless because it gives no more information on the quantity of the effects. The weighted procedure performs well and shows marginal improvement of the single-step approach in our application. The weight function (7) has a very general form and depends on the choice of the continuous decreasing function E(x). In this paper, we work with an exponential function. We also made simulations with different choices of E(x); however, the results were substantially the same.
Further applications and scenarios could be considered to make full use of the theoretical strength of the weighted approach. In addition, this procedure might be extended by incorporating the common distribution of the test statistics, as for the MtI-and plug-in-options in the single-step and stepdown case. An extension to settings with more than one reference is possible as well.

ACKNOWLEDGEMENT
This research was supported by the DFG grant BR 3737/1-1.

DATA AVAILABILITY STATEMENT
The R programs for the simulation study are available from the corresponding author upon request.