Beyond the Two-Trials Rule

The two-trials rule for drug approval requires"at least two adequate and well-controlled studies, each convincing on its own, to establish effectiveness". This is usually employed by requiring two significant pivotal trials and is the standard regulatory requirement to provide evidence for a new drug's efficacy. However, there is need to develop suitable alternatives to this rule for a number of reasons, including the possible availability of data from more than two trials. I consider the case of up to 3 studies and stress the importance to control the partial Type-I error rate, where only some studies have a true null effect, while maintaining the overall Type-I error rate of the two-trials rule, where all studies have a null effect. Some less-known $p$-value combination methods are useful to achieve this: Pearson's method, Edgington's method and the recently proposed harmonic mean $\chi^2$-test. I study their properties and discuss how they can be extended to a sequential assessment of success while still ensuring overall Type-I error control. I compare the different methods in terms of partial Type-I error rate, project power and the expected number of studies required. Edgington's method is eventually recommended as it is easy to implement and communicate, has only moderate partial Type-I error rate inflation but substantially increased project power.


Introduction
The two trials rule is a standard requirement by the FDA to demonstrate efficacy of drugs.It demands "at least two pivotal studies, each convincing on its own" before drug approval is granted.The rule is usually implemented by requiring two independent studies to be significant at the standard (one-sided) α = 0.025 level [1,Sec. 12.2.8].Statistical justification for the two-trials rule is usually based on a hypothesis testing perspective where Type-I error (T1E) control is the primary goal.The T1E rate is the probability of a false claim of success under a certain null hypothesis.Two different null hypotheses are relevant if results from two studies are available: The intersection null hypothesis is a point null hypothesis, defined as the intersection of the study-specific null hypotheses H i 0 : θ i = 0, i = 1, 2, here θ i denotes the true effect size in the i-th study.The probability of a false claim of success with respect to the intersection null (1) is the overall or project-wise T1E rate.The overall T1E rate of the two-trials rule is α 2 = 0.025 2 = 0.000625, because the two studies are assumed to be independent.
The no-replicability or union null hypothesis is defined as the complement of the alternative hypothesis that the effect is non-null in both studies [2].This is a composite null hypothesis, which also includes the possibility that only one of the studies has a null effect: I follow Micheloud et al. [3] and call the probability of a false claim of success with respect to the union null (2) the partial T1E rate.The partial T1E rate depends on the values of θ 1 and θ 2 , where one of the two parameters must be zero but the other one may not be zero.
The partial T1E rate of the two-trials rule is bounded from above by α, the exact value depends on the difference θ 1 − θ 2 [4].
In principle there are now two options to develop alternatives to the two-trials rule, see Figure 1 for an illustration with Edgington's method.First, we may consider methods with partial T1E rate bound at α, but this will inevitably reduce the overall T1E rate and The grey squared region represents the success region of the two-trials rule in terms of the two p-values p 1 , p 2 ≤ α = 0.025, so the partial T1E rate is controlled at α and the overall T1E rate is α 2 .The success region below the red line also controls the partial T1E rate at α, but has reduced overall T1E rate of α 2 /2.The success below the green line controls the partial T1E rate at √ 2 α ≈ 0.035 and has overall T1E rate α 2 , the same as for the two-trials rule.
the project power, because the success region of any such method (in terms of the trialspecific p-values p 1 and p 2 ) must be a subset of the success region of the two-trials rule.
Alternatively we may fix the overall T1E rate at α 2 and allow for some inflation of the partial T1E rate.Now the success regions is no longer a subset of the success region of the two-trials rule, so the impact on project power is not immediate.The latter approach has been also selected by Rosenkranz [5] as an "overarching principle" in the search for generalizations of the two-trials rule.Our goal is thus to allow for some (limited) inflation of the partial T1E rate while maintaining overall T1E control.I follow the definition of the Fisherian T1E probability [6], where p-values represent quantitative measures of the evidence against a null hypothesis.I propose to base inference on p-value combination methods which return a combined p-value that can be interpreted as a quantitative measure of the total available evidence.However, many standard methods do control the partial T1E rate only at the trivial bound 1, for example Fisher's or Tippett's method [7].Also Stouffer's "inverse normal" method, which is closely related to a fixed effect meta-analysis and called the "pooled-trials rule" in Senn [1], does not control the partial T1E rate at a non-trivial bound.Indeed, all these methods can flag success even if one of the two studies is completely unconvincing with an effect estimate perhaps even significant in the wrong direction.This may be acceptable within a study to be able to stop an experiment at interim for efficacy [8], but is unacceptable to assess replicability across studies [9].I therefore concentrate on less-known p-value combination methods that have an in-built non-trivial control of the partial T1E rate: The n-trials rule, Pearson's and Edgington's method, the harmonic mean χ 2 -test and the sceptical p-value.
The outcome of this paper is a tentative recommendation for Edgington's p-value combination method [10]  where stopping for success after two studies is possible, otherwise a third study will be required.All these approaches control the overall T1E rate at α 2 = 0.025 2 while ensuring that all studies considered are "convincing on its own" to a sufficient degree, i. e. they have a non-trivial and sufficiently small bound on the partial T1E rate.Furthermore, they have larger power to detect existing effects than the two-resp.three-trials rule and other attractive properties.

Box 1 Edgington's method: An alternative to the two-trials rule
Error control: Overall T1E control at level 0.025  Section 2 describes p-value combination methods with a non-trivial bound on the partial T1E rate.Section 3 compares these methods for data from two respectively three trials.For three trials we also discuss the 2-of-3 method recently proposed by Rosenkranz [5].Section 4 develops sequential versions of some of the methods, which allow to stop early for success after two trials.A comparison in terms of project power and expected number of studies required is presented.I close with some discussion in Section 5.

P-value combination methods with partial T1E control
In the following I will describe p-value combination methods that control the partial T1E at a non-trivial bound different from 1. Throughout I will work with one-sided p-values p 1 , . . ., p n and assume that the p i 's are independent and uniformly distributed under the null hypothesis H i 0 : Overall and partial T1E rate are now defined accordingly across all n trials.Throughout I aim to achieve an overall T1E rate rate of α ∩ , usually α ∩ = 0.025 2 = 0.000625.

The n-trials rule
The two-trials rule can easily can be easily be generalized to n trials.For example, n = 3 independent trials need to achieve significance at level α = α 2/3 to control the overall T1E rate at level α ∩ = α 3 = α 2 .For α = 0.025 we obtain α = 0.085, so the partial T1E rate bound of the three-trials rule at overall level α ∩ = 0.025 2 is also 0.085.The general threshold is α = α 2/n , which serves as a benchmark for other methods based on n trials with overall T1E rate α ∩ = α 2 .The combined p-value of the n-trials rule is p = max{p 1 , p 2 , . . ., p n } n .

Pearson's combination test
Pearson's combination method [11,12] is a less-known variation of Fisher's method.
Fisher's method [13] is based on the test statistic which follows a χ 2 2n distribution if the p-values p 1 , . . ., p n are independent uniformly distributed.Large values of F n provide evidence against the intersection null, and thresholding F n at the (1 It follows that a sufficient (but not necessary) criterion for success is that at least one p-value fulfills For example, for n = 2 and α ∩ = 0.025 2 = 0.000625, the right-hand side of (5) is 0.00006.
If the first p-value p 1 is smaller than this bound, Fisher's criterion (4) will flag success no matter what the result from the second study is and does therefore not control the partial T1E rate at a non-trivial bound.
Pearson's method uses the fact that if p i is uniform also 1 − p i is uniform, so with corresponding combined p-value p = Pr(χ 2 2n ≤ K n ).It follows from ( 7) that a necessary (but not sufficient) success criterion is that all p-values fulfill For n = 2 and α ∩ = 0.025 2 the right-hand side of ( 8) is 0.035, which is then also the bound of Pearson's method on the partial T1E rate and only slightly larger than for the two-trials rule where both p-values have to be smaller than 0.025.

Edgington's method
Edgington [10] proposed a method that combines p-values by addition rather than multiplication as in Fisher's criterion (4).Under the intersection null hypothesis, the distribution of the sum of the p-values follows the Irwin-Hall distribution [14,15], denoted as E n ∼ IH(n), see also where the left-hand side is Edgington's test statistic (9).For n = 2 and α ∩ = 0.025 2 the right-hand sided of ( 11) is 0.036, nearly identical to the critical value b 2 = 0.035 based on Edgington's method.

Held's method
The harmonic mean χ 2 -test [17], in the following abbreviated as Held's method, is a recently proposed p-value combination method that also controls the partial T1E rate at a non-trivial bound.Consider the inverse normal transformation Under the intersection null hypothesis, all p-values are uniform and the corresponding Z-values are therefore independent standard normal.The harmonic mean χ 2 -test statistic then follows a χ 2 1 distribution.We are interested in one-sided alternatives where all effect estimates are positive, say, and flag success if thus depends on the overall T1E rate α ∩ and the value of n [17, Section 2].The requirement X 2 n ≥ d n is equivalent to From (13) we see that 1/Z 2 i ≤ c n must hold for all i = 1, . . ., n to achieve success.This can be re-written as a necessary (but not sufficient) success condition on the p-value from the i-th trial: The right-hand side of ( 14) thus represents a bound on the partial T1E rate.For α ∩ = 0.025 2 and n = 2 studies we obtain the value 1 − Φ(1/ √ 0.44) = 0.065.This shows that Held's method, applied to two studies, controls the partial T1E rate, but at a larger bound than Pearson's or Edgington's method.

Sceptical p-value
The sceptical p-value [18] has been developed for the joint analysis of an original and a replication study, so is restricted to n = 2 studies.As the harmonic mean χ 2 -test, it depends on the squared z-statistics Z 2 1 and Z 2 2 , but also takes into account the variance ratio c = σ 2 1 /σ 2 2 , the ratio of the variances σ 2 1 and σ 2 2 of the effect estimates (original to replication).
In its original formulation the sceptical p-value p S is always larger than the two studyspecific p-values, so controls the partial T1E rate at α if p S ≤ α defines replication success.
Its overall T1E rate is considerably smaller than α 2 and depends on c. Micheloud et al. [3] have recently developed a recalibration that enables exact overall T1E rate at α 2 , for any value of c.A consequence of this recalibration is that the bound γ on the partial T1E rate is now larger than α and increases with increasing c.The limiting case c → 0 corresponds to the two-trials rule.The harmonic mean χ 2 -test for n = 2 studies turns out to be another special case for c = 1.
Here we propose to use the sceptical p-value as a flexible combination method of pvalues from two independent trials.The parameter c is then free to choose and not related to the variance ratio.It can be selected to achieve a desired bound γ on the partial T1E rate while maintaining the overall T1E rate at α 2 , see the inset plot of Figure 2.For example, we might consider a bound of γ = 2α = 0.05 on the partial T1E rate as acceptable [9], in which case c = 0.43.The more stringent bound γ = 1.2 α = 0.03 is obtained with c = 0.04.

Simulation study
I now report the success probability of the different methods under different scenarios.
It is well known [19] that under the alternative that was used to power the two trials, the distribution of Z 1 and Z 2 is N(µ i , 1) where is the significance threshold and 1 − β i the power of each trial.The case µ i = 0 (where represents the null hypothesis H i 0 : θ i = 0. We can thus simulate independent Z 1 and Z 2 for α = 0.025 and different values of the individual trial power 1 − β i ∈ {0.9, 0.8, 0.6} and compute the project power, the proportion of results with drug approval at overall T1E rate α 2 = 0.025 2 .The simulation is based on 10 6 samples, so the Monte Carlo standard error is smaller than 0.05 on the percentage scale.The results shown in Table 1 are in the expected order with increasing project power for increasing bound on the partial T1E rate.The increase in project power is quite substantial compared to the two-trials rule.Already Edgington's method has an increase between 3 and 5 percentage points, the sceptical p-value (with c = 0.43) increases the project power by 5 to 7 percentage points.Held's method has an even larger increase, between 6 and 8 percentage points.
We can also investigate the probability of success if one of the true treatment effects is zero (say µ 1 = 0), but the other one is not.This probability cannot be larger than the bound on the partial T1E rate of the respective method and it is interesting to see how much smaller it is for different values of the power to detect the effect in the second study.The results in Table 2 show that the probability of a false positive claim has the same ordering as the project power shown in Table 1.The probabilities are fairly small, for example between 1.9 and 3.4% for the sceptical p-value p S with c = 0.43 and upper bound 5%.Even Held's method with a relatively large bound of 6.5% has a partial T1E rate of only 3.8% if the power of the second study is as high as 90%.

The 2-of-3 rule
Additional issues arise in the application of the two-trials rule if more than two studies are conducted [20].Requiring two out of n studies to be significant at level α inflates the overall T1E rate beyond α 2 and so adjustments are needed.Rosenkranz [5] introduces the k-of-n rule [see also earlier work in 21,22] and argues that "the [overall] type-I error rate of any procedure involving more than two trials shall equal the [overall] type-I error rate from the two trials rule."Here I consider the "2-of-3 rule" where k = 3 studies are conducted, two of which need to be significant to flag success.Rosenkranz [5] shows that the corresponding significance level has to be reduced to α 2|3 = 0.015 to ensure that the overall T1E rate is α 2 = 0.025 2 .However, the 2-of-3 rule no longer controls the partial T1E rate, because one of the three studies can be completely unconvincing, perhaps even with an effect size estimate in the wrong direction.This is a major drawback that may be considered as unacceptable by regulators.
Furthermore, thresholding individual studies for significance without taking into account the results from the other studies creates problems in interpretation.For example, suppose that the first two studies have p-values p 1 = p 2 = 0.02 while the third study has not yet been started.The 2-of-3 rule would then stop the project for failure, even if the standard two-trials rule (falsely assuming only two studies have been planned from the start) would flag success.Now suppose results from the third trial are also available, with p 3 = 0.001, say.The 2-of-3 rule would then still conclude project failure although the combined evidence for an existing treatment effect will be overwhelming.For example, the three-trials rule would flag a clear success as all three p-values are smaller than 0.085 and the combined p-value p = max{p 1 , p 2 , p 3 } 3 = 0.02 3 = 0.000008 is much smaller than 0.025 2 = 0.000625.These considerations illustrate that individual thresholding of studyspecific p-values may lead to paradoxes that can be hard to explain to practitioners.

Other methods
I now investigate the applicability of the p-value combination methods described in Section 2 to data from three trials.The sceptical p-value is not available for 3 (or more) studies, so we restrict attention to the remaining methods.Application of Held's method to n = 3 studies requires every study-specific p-value p i to be smaller than 0.17.This is only slightly larger than the bounds 0.15 and 0.16 with Pearson's respectively Edgington's method.These values need to be compared not to 0.025, but to 0.085, the adjusted significance threshold of the three-trials rule.The increase in partial T1E rate of Pearson's and Edgington's method is therefore not more than twofold, Held's method has an increase by a factor of 2.04.

Simulation study
I have conducted another simulation study with n = 3 trials, powered at different values, now with significance level 0.085.The project power listed in Table 3 shows very similar values for Pearson's, Edgington's and Held's method.Those values are considerably larger (nearly 10 percentage points) than for the three-trials rule.The 2-of-3 rule is also listed and has less increase in project power of around 3-6 percentage points.
Turning to the partial T1E rates shown in Table 4 we see values close to 50% for the 2-of-3 rule if one of the trials comes from the null and the other two have power 90% to detect an existing effect.This illustrates the lack of control of the partial T1E rate.Pearson's, Edgington's and Held's methods behave again very similar, with moderately increased partial T1E rates compared to the three-trials rule.For example, in the worst-case scenario the partial T1E rate is between 10.6 and 11.0% rather than 6.8%.

Example
We have already considered the example with p 1 = 0.02, p 2 = 0.02 and p 3 = 0.001, which does not lead to success with the 2-of-3 rule.

Early stopping
Of particular interest is a sequential application of the different methods.The 2-of-3 rule has some advantages here, as it can stop already after two studies, see Figure 3a for a schematic illustration.Specifically, if the first two studies are significant at level α 2|3 , a third trial is no longer needed and resources can be saved.Likewise, if the first two studies are both not significant, a third trial is pointless because success can never be achieved.However, there is no mechanism to stop already after the first trial, even if it is fully unconvincing.
In contrast, all methods that control the partial T1E rate at a non-trivial bound can stop for failure already after the first and second trial, as illustrated in Figure 3b for Edgington's method.However, to achieve success these methods always need three trials.
This is because they are all based on a "budget" a 3 , b 3 resp.c 3 and the results from the three trials need to be "within budget": (Held).
A very convincing trial will cost only very little and will reduce the budget only little.
On the other hand, an unconvincing first trial is likely to overspend the budget so success will be impossible, no matter what the results of the remaining two studies will be.We can hence stop the project for failure and there is no need to conduct the second and third study.Likewise, if the budget is overspent after the second trial, there is no need to conduct the third study.
What is different with the three methods is the "currency", either the "price" is given in Held).This gives different weights for different p-values, as shown in Figure 4, where the price of a single study is normalized to a "unit budget".The Figure shows that the price of Pearson's method is close to Edgington's method, so nearly linear.In contrast, Held's method has higher prices of convincing studies with small p-values, while less convincing studies are "cheaper".The difference between Held's and the other methods is larger for n = 2 than for n = 3  studies.
Possible decisions after two trials are shown in the left panel of Figure 5 for the different methods considered and compared to the three-trials rule.Remarkably, the area where the three-trials rule continues to the third study is considerably smaller than for the other three methods.Held's method will always continue with a third study if the three-trials rule does so, but may also continue if the three-trials rule doesn't.

Sequential application
Sequential application of one of the methods will also allow to stop for success after 2 studies.Adjusted significance levels α 2 and α 3 for the tests after two respectively three studies then have to be chosen such that the overall T1E rate is equal to α ∩ = α 2 .The level α 2 = q α ∩ will be a proportion q of α ∩ , following the theory of group-sequential methods [19,Section 8.2].Of course, the more lenient we are after the first study, the more strict we have to be after the second.I will choose q = 0.72 throughout to allow for a 20% increase of the partial T1E rate to 0.03 for Edgington's method and n = 2 trials, but of course other choices can be made.Sequential application with q = 0.72 is illustrated in Figure 3c.As in the non-sequential version, stopping for failure after Trial 1 or Trial 2 is possible, but now also stopping for success after two trials.
Computation of the adjusted level α 3 to be used after results from all three trials are available can be done as in group-sequential trials [ 6.: Adjusted significance levels α i and partial T1E bounds γ i , i = 2, 3, for the twoand three-trials rule, Pearson's, Edgington's and Held's method, including sequential versions of the latter three. ization for Pearson's, Edgington's and Held's method, respectively.The two terms on the righthand side of each equation are independent with known distributions under the intersection null.A convolution can be therefore used to compute the distribution of K 3 , E 3 respectively H 3 conditional on no success after two studies.This can then be used to compute the adjusted level α 3 shown in Table 6, details are given in Appendix A. Table 6 also gives the corresponding partial T1E bounds γ 2 resp.γ 3 for the sequential methods, which are smaller than for the non-sequential methods.

Possible decisions after the first trial
It is interesting that the sequential versions give three (rather than just two) options how to proceed after the first study has been conducted.This is illustrated in Figure 6 for Edgington's and Held's method and q = 0.72, Pearson's method will be very similar to Edgington's method.As we can see, the three options are stopping for failure if p 1 > γ 3 , continuing with a second trial (if p 1 ≤ γ 2 ) or continuing directly with a second and third trial, perhaps even in parallel.The last category is chosen, if the p-value of the first study is only "suggestive" in the sense that a second trial will never lead to success but a second and a third trial may do.Formally this is achieved if

Possible decisions after the second trial
The possible decisions of sequential methods (with q = 0.72) after the second trial are displayed in the right panel of Figure 5. Now there is the additional possibility that the procedures can stop for success.As a result, the failure regions are somewhat larger than in the non-sequential versions.The 2-of-3 rule is also displayed and has a very different success, continuation and failure region.The right figure in Figure 7 gives the expected number of trials required before the project is stopped.If the power is low, the 2-of-3 rule has the largest expected number of studies required.All other methods can stop for failure already after trial 1, so the expected number of trials is between 1 and 2 for underpowered studies.If the power is large, the non-sequential methods require the largest number of studies on average, because they require three studies to be conducted to reach success.The sequential methods of Pearson's, Edgington's and Held's method require a smaller number of studies, never larger than around 2.2 studies on average.

Discussion and Outlook
Alternatives to the two-trials rule require appropriate T1E control, both overall and partial.I have described and compared different p-value combination methods that have the same overall T1E rate, offer partial T1E control at a non-trivial bound and can be extended to a sequential assessment of success.The methods have different properties and a particular one can be chosen based on a desired bound on the partial T1E rate.still substantially increased project power compared to the two-respectively three-trials rule.Of course, to guarantee exact overall T1E rate control, the specific type of combination method, whether fixed or sequential as well as the maximum number of studies conducted have to be defined in advance of the project.This max be considered a limitation from a study conduct point of view [5].
The sequential versions of the different combination methods have the advantage that they can stop for success already after two trials.For q = 0.72, the probability that a third study has to be conducted is bounded by 30% (Pearson), 30% (Edgington) respectively 34% (Held), assuming that all three trials have the same power to detect the true effect, see Figure 8 in Appendix B. It remains to be investigated which choice of q gives optimal operating characteristics, for example in terms of project power or the expected number of trials.
In the simulations presented I have assumed that the sample size of each trial was calculated in advance based on a pre-specified clinically relevant difference.However, the different methods also allow to compute the sample size adaptively based on the results from the previous trial [23].For example, we could design a second trial to detect the observed effect from the first trial with a certain power.This will be particularly simple for Edgington's method, where the required significance level is b 2 − p 1 , but also straightforward for the other approaches.A comparison of the expected number of patients needed under the different procedures could be done and the conditional T1E rate could be investigated [3,Section 3.4].We may also compute the conditional (or predictive) power to reach success given the results from an initial trial.The bound on the T1E rate can directly be used to describe when this power is exactly zero and when not.But even if conditional power is non-zero it can be too small to warrant a further trial.Conditional power can be computed directly for n = 2 from the results described in Micheloud et al.
It is well-known from group-sequential methods, that the p-value obtained after stopping for success overstates the evidence against the null and some adjustment is required [24].will be considered in future work.It is also possible to adopt the sequential methods such that a stop for success already after the first study would be possible.This would incorporate the argument brought forward by Fisher [25] that one large and very convincing trial may give sufficient evidence of efficacy, but would have to be very convincing.
Also Kay [26] notes that "where there are practical reasons why two trials cannot be easily undertaken or where there is a major unfulfilled public health need, it may be possible for a claim to be based on a single pivotal trial."The notion of replicability could then be enforced in a post-marketing requirement after conditional or accelerated drug approval [23].
P-value combination methods can be inverted to obtain a confidence interval for the effect estimate based on the corresponding p-value function [27].In future work we will compare the different methods, for example in terms of coverage and width of the resulting confidence intervals.We will also investigate how this needs to be adopted to the sequential setting, where the standard combined effect estimate in group-sequential trials is known to be biased if we stop early for success [28].

A.1. Pearson's method
Let a k denote the available budget based on the adjusted significance level α k , k = 2, 3.
To compute the T1E rate of the sequential Pearson method we need the distribution of the random variable Ga(k 2 ; 2, 1/2)1 {k 2 >a 2 } , now Ga(x; a, b) denotes the density of the gamma distribution at x with shape parameter a and rate parameter b. else.
Numerical integration of f (k 3 | {K 2 > a 2 }) can be used to compute the cumulative distribution function Root-finding methods are then used to determine the value a 3 where The corresponding value of the nominal level α 3 can finally be obtained with the quantile function of the gamma distribution.

A.2. Edgington's method
Let b k denote the available budget based on the adjusted significance level α k , k = 2, 3.
To compute the T1E rate of the sequential Edgington method, we need the distribution x for 0 ≤ x < 1, The convolution of E 2 | {E 2 > b 2 } and p 3 therefore has density

Figure 1 .
Figure1.: Illustration of partial and overall T1E control with Edgington's method.The grey squared region represents the success region of the two-trials rule in terms of the two p-values p 1 , p 2 ≤ α = 0.025, so the partial T1E rate is controlled at α and the overall T1E rate is α 2 .The success region below the red line also controls the partial T1E rate at α, but has reduced overall T1E rate of α 2 /2.The success below the green line controls the partial T1E rate at √ 2 α ≈ 0.035 and has overall T1E rate α 2 , the same as for the two-trials rule.

Figure 2
Figure 2 compares the success regions of the two-trials rule with Pearson's, Edgington's and Held's method for n = 2 studies, and the sceptical p-value with c = 0.04 and c = 0.43.The success region of the two-trials rule requires both p-values to be smaller than α = 0.025 and is represented by the grey squared area.Pearson's method gives a success region nearly a straight line due to the approximation log(1 − p i ) ≈ −p i , which leads to the approximate Pearson success criterion p 1 + p 2 0.035.The success region of Edgington's method is bounded by an exact straight line and nearly identical.The success regions of the sceptical p-value (with c = 0.43) and Held's method are bounded by a convex line with larger bounds on the partial T1E rate.The success region of the sceptical p-value with c = 0.04 has more overlap with the one from the two-trials rule.

Figure 2 .
Figure 2.: Success regions of different p-value combination methods depending on the two p-values p 1 and p 2 .The bound on the partial T1E rate is denoted as γ.The two-trials rule success region is the squared gray area below the black line where c = 0 and γ = α.All methods control the overall T1E rate at α 2 = 0.025 2 = 0.000625, the area under each curve.The inset plot shows the dependence of the parameter c of the sceptical p-value on the bound γ of the partial T1E rate.

Figure 4 .
Figure 4.: The price of a single study for Pearson's, Edgington's and Held's method as a function of the one-sided p-value.The price is normalized to a unit total budget (the dashed horizontal line) for n = 2 (left) and n = 3 (right) trials at overall T1E level α ∩ = 0.025 2 .

Figure 5 .
Figure 5.: Possible decisions with Pearson's, Edgington's and Held's method after two studies with one-sided p-values p 1 and p 2 have been conducted.The nonsequential procedures (left) will stop for failure if the point (p 1 , p 2 ) is above the respective solid line, otherwise a third trial will be conducted.The grey areas indicate the corresponding continuation region of the three-trials rule.The sequential procedures (right) are indicated with two lines in the same color and will stop for failure if the point (p 1 , p 2 ) is above the upper line.They will stop for success if the point is below the lower line.If the point is between the upper and lower line, an additional third trial will be conducted.The grey areas indicate the corresponding success and continuation regions of the 2-of-3 rule.

Figure 6 .
Figure 6.: Possible decisions based on the p-value p 1 from the first trial and the sequentialEdgington's resp.Held's method.Both methods control the overall T1E rate at α ∩ = 0.025 2 and spend the proportion q = 0.72 of α ∩ on α 2 .

Figure 7
Figure 7 compares operating characteristics of the different methods for data from 3 trials based on a simulation with 10 7 samples.Each individual trial has been powered at the standard significance level 0.025 with varying power between 2.5 (where µ i = 0) and 97.5%.The left figure gives the difference in success rate compared to the three-trials rule.It shows that most of the methods have increased project power, the largest is obtained for the non-sequential versions, where Held's and Edgington's method are slightly better than Pearson's.The sequential version show smaller improvements compared to the three-trials rule, now with a more pronounced advantage of Held's method due to the substantially larger bound on the partial T1E rate for 2 trials.The 2-of-3 rule has less project power than the three-trials rule, if the individual trials are underpowered, but more power if the trials are reasonably powered.

Edgington's methodFigure 7 .
Figure 7.: Success rate (left, change in % compared to the three-trials rule) and number of trials needed (right) for the different methods as a function of trial-specific power.

Figure 8
Figure8shows for the different approaches the probability to stop after the first, second or third trial as a function of the power of the individual trials varying between 2.5 and 97.5%.The three-trials and 2-of-3 rule offers no possibility to stop after the first trial, so only two lines are shown.
as summarized in Box 1: Declare success if the sum of the (onesided) p-values from the two studies is smaller than √ 2 α ≈ 0.035.A single p-value can thus be larger than 0.025 to lead to success (but not larger than 0.035) as long the other one is sufficiently small.If three studies are considered, then the sum of the three pvalues needs to be smaller than 3 √ 6 α 2 ≈ 0.16.Sequential versions are also developed, 2 Input: One-sided p-values p 1 and p 2 from two trials, possibly p 3 from a third trial • 2 trials in parallel: Flag success if p 1 + p 2 ≤ 0.035 • 3 trials in parallel: Flag success if p 1 + p 2 + p 3 ≤ 0.16 • 3 sequential trials: -Flag success if after two trials p 1 + p 2 ≤ b 2 = 0.03 -Otherwise flag success if after three trials p 1 + p 2 + p 3 ≤ b 3 = 0.11 (Other choices can be made for b 2 and b 3 ) Now small values of K n provide evidence against the intersection null and we need to threshold K n at the α ∩ -quantile a n = χ 2 2n (α ∩ ).This gives the success criterion K n ≤ a n , or equivalently = 0.16 for n = 3.These bounds can be thought of as a maximum "budget" to be spent on the individual p-values to achieve success.If n = 2, for example, p 1 + p 2 ≤ 0.035 is required to flag success at the 0.025 2 level, for n = 3 the success condition is p 1 + p 2 + p 3 ≤ 0.16.The simplicity of this rule is very attractive in practice, even if more flexibility in the choice of the partial T1E rate bound may be warranted.

Table 1 .
: Project power of different methods for drug approval depending on individual trial power (all entries in %) Trial 2 Two-trials rule p S (c = 0.04) Pearson Edgington p S (c = 0.43) Held

Table 2 .
: Partial T1E rate of different methods for drug approval depending on individual trial power (all entries in %).NULL indicates a trial with zero effect size.

Table 4 .
: Partial T1E rate of different methods for drug approval depending on individual trial power (all entries in %).NULL indicates a trial with true effect size of zero.
p-values with the different methods discussed in Section 2. All would flag success at the α ∩ = 0.025 2 level with the three-trials rule having the smallest p-value.Table5also shows another example with p 1 = 0.01, p 2 = 0.01 and p 3 = 0.2, where the 2-of-3 rule would flag success, but all combination methods would not.Now the three-trials rule has the largest combined p-value.These examples illustrate that the 2-of-3 rule behaves very different than the other four methods.

Table 5 .
: Combined p-value with different methods for two examples with p-values from three trials 19, Section 8.2.2] based on the factor- (15)condition is needed, because the assessment of success after 3 studies requires that there was no success after 2 studies, i. e.K 2 > a 2 .Due to(15), the density f (k 3 | {K 2 > a 2 }) can be obtained by a convolution of the density of K 2 | {K 2 > a 2 } and the density of −2 log(1 − p 3 ) ∼ Ga(1, 1/2), a gamma distribution with shape and rate parameters 1 and 1/2.The density ofK 2 | {K 2 > a 2 } is a Ga(2, 1/2)density truncated to K 2 > a 2 , which occurs with probability 1