Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach

Authors


Michael C. Whitlock, Department of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4.
e-mail: whitlock@zoology.ubc.ca

Abstract

The most commonly used method in evolutionary biology for combining information across multiple tests of the same null hypothesis is Fisher's combined probability test. This note shows that an alternative method called the weighted Z-test has more power and more precision than does Fisher's test. Furthermore, in contrast to some statements in the literature, the weighted Z-method is superior to the unweighted Z-transform approach. The results in this note show that, when combining P-values from multiple tests of the same hypothesis, the weighted Z-method should be preferred.

In evolutionary biology, as in most branches of empirical science that have embraced a statistical approach, it is relatively common to have several independent tests of the same null hypothesis. Often we would like to combine the results of these tests to ask whether there is evidence from the collection of studies that we might reject the null hypothesis. The collection of methods known as meta-analysis gives many ways to do these combinations, including some techniques that combine P-values from multiple independent tests.

In this note, I will briefly review some methods that are used to address this type of question. The most commonly used method in evolutionary biology is Fisher's combined probability test, but there are other methods available. Two of the most effective alternatives are analysed here. The main part of this note describes a simulation study of Type-I error rates, power and precision of the results from these various combined probability tests. The results show that a procedure called the weighted Z-method is superior in many ways to Fisher's combined probability test, and it should therefore be preferred. I start by reviewing the common assumptions of all of these methods, followed by a review of Fisher's method and some of its alternatives.

Assumptions, definitions and transformations

The procedures that we are interested in combine P-values from several independent studies to test whether collectively they can reject a common null hypothesis. Assume that there are k of these studies to be combined. The ith study has νi d.f., and we can write the one-tailed P-value of each study Pi. One stumbling block to the use of these methods is that papers often do not give precise P-values, but these must be obtained in some way in order to use the combined tests.

Notice that the paragraph above talked about one-tailed P-values. Most P-values in our literature are in fact two tailed, which leads to a subtle issue. Let us imagine that we are comparing two groups, say inbred and outbred individuals for some trait like body size. If we want to know whether inbreeding has an effect on body size, we would want to distinguish between results that showed inbred individuals with larger body size and results that showed that outbred individuals were larger. Yet either result could lead to the rejection of the null hypothesis of no effect with a two-tailed test. If a few studies that show significant reduction in body size are balanced by a few more that show a significant increase in size with inbreeding, then we would want to conclude that on average inbreeding did not affect body size. Thus we would want to keep track of the direction of the effect when we combine the information from the different studies. The solution is to calculate the P-value from each test as a one-sided P-value, with P-values close to zero having a consistent meaning (say, inbred smaller than outbred) and P-values close to one having the opposite meaning (outbred smaller than inbred). After combining the P-values, if desired the resulting combined P can be again converted to a two-tailed test by multiplying it by two. The rest of the paper assumes that P values have been written on a consistent one-tailed scale.

Fisher's combined probability test

Fisher's combined probability test (Fisher, 1932) uses the P-values from k independent tests to calculate a test statistic inline image. If all of the null hypotheses of the k tests are true, then this inline image will have a χ2 distribution with 2k degrees of freedom.

Fisher intended his procedure to give a ‘composite test,’ a ‘test of the significance of the aggregate’ of a number of independent tests (Fisher, 1932, p. 97). It has since been interpreted as a test of whether at least one of the studies being combined can reject its null hypothesis (see, e.g. van Zwet & Oosterhoff, 1967; Westberg, 1985; Rice, 1990). By this interpretation, Fisher's method has much in common with the corrections for multiple corrections that have come into vogue in evolution (Westberg, 1985; and see Rice, 1989). But this is not the original interpretation. Fisher, with typical inscrutability, seems to have intended that the test ask whether the accumulation of information among tests on similar null hypotheses can reject that shared null hypothesis. This seems to be the usage of the test most often applied in evolutionary biology, and this is what we want to explore here.

Fisher's method, however, does have one significant drawback in this context. It treats large and small P-values asymmetrically. It is easiest to see this problem with an example (see Rice, 1990). Imagine there were two studies on a topic that we would like to combine. One of these studies rejected the null hypothesis with P = 0.001, while the other did not with P = 0.999. Clearly, on average there is no consistent effect in these two studies, yet by Fisher's method the P-value is P = 0.008. Fisher's method is asymmetrically sensitive to small P-values compared to large P-values. The undesirability of this result can be seen when we consider that these results would reject the null hypothesis in favour of contradictory one-tailed alternate hypotheses.

As a further example of this asymmetry, remember that high P-values (i.e. near one) would be evidence in favour of rejecting a null hypothesis in favour of the opposite alternate hypothesis. A result of P = 0.99 is as suggestive of a true effect as is a result of P = 0.01. Yet with Fisher's test, if we get two studies with P = 0.01, the combined P is 0.001, but with two studies with P = 0.99 the combined P is 0.9998. One minus 0.9998 is 0.0002. The high P and low P results differ by an order of magnitude, yet the answer should be the same in both cases.

This asymmetry results in a bias for results combined from multiple studies on the same null hypothesis. The bias is not normally as great as in these examples, but any bias is undesirable. Fortunately, there are other methods available.

The Z-transform test

I will consider in detail two alternative methods for combining P-values here: the Z-transform test and the weighted Z-test. These two are closely related to each other; the Z-transform test is the same as a weighted Z-test with equal weight given to all studies.

The Z-transform test takes advantage of the one-to-one mapping of the standard normal curve to the P-value of a one-tailed test. By Z we mean a standard normal deviate, that is, a number drawn from a normal distribution with mean 0 and standard deviation 1. As Z goes from negative infinity to infinity, P will go from 0 to 1, and any value of P will uniquely be matched with a value of Z and vice versa. The Z-transform test converts the one-tailed P-values, Pi, from each of k independent tests into standard normal deviates Zi. The sum of these Zi’s, divided by the square root of the number of tests, k, has a standard normal distribution if the common null hypothesis is true. Thus,

image

can be compared to the standard normal distribution to provide a test of the cumulative evidence on the common null hypothesis. Clearly this test does not suffer from the asymmetry problems discussed above for Fisher's method.

As this test is sometimes called ‘Stouffer's method,’ here I will write its test statistic as ZS. (The Z-transform test was originally performed by Stouffer et al., 1949 in a footnote on p. 45 of their sociological work on the Army, The American Soldier. This must be one of the most obscure origins of a statistical method in the literature.)

Weighted Z-method

This Z-transform test was generalized (apparently independently of Stouffer et al.'s work and of each other) by Mosteller & Bush (1954) and Liptak (1958) to give different weights to each study according to their power. In the weighted Z-method, each test can be assigned a weight, wi:

image

When each test is given an equal weighting, this reduces to the Z-transform procedure.

How are these weights to be chosen? Ideally each study is weighted proportional to the inverse of its error variance, that is, by the reciprocal of its squared standard error. For studies that use t-tests, for example, this is done by weighting each study by its d.f.: wi = νi. More generally, the weights should be the inverse of the squared standard error of the effect size estimate for each study.

Whether to use the weighted or unweighted version of this test is actually an open question in meta-analysis. On the one hand, it may seem logical that we would want to weight studies with more information more strongly than those with less information. On the other hand, the argument has been made that P-values already weighted by sample size. If the null hypothesis is globally true, then all P-values should be uniformly distributed between 0 and 1 regardless of sample size. When the null hypothesis is false, then for the same effect size the P-value will be smaller for a larger study than for a small study. Thus some (e.g. Becker, 1994) have argued that it is inappropriate to weight studies differentially when combining P-values. The relative value of the two apparently has never been tested, aside from this rather philosophical argument.

Methods: comparing error rates, power and accuracy of P

We test the properties of the combined probability tests by simulating a one-sample t-test on data that match the assumptions of the t-test. The t-test compares the mean of the sample to a constant μ0, given by the null hypothesis. Without loss of generality, the null hypothesis for all tests in this paper is μ0 = 0. Data are drawn from a common normal distribution with unit variance. The mean of this distribution, μ, was set to either zero (when we are calculating Type-I error rates, i.e. the null hypothesis μ = 0 is true) or greater than zero (when we calculate the Type-II error rates, that is, when the null hypothesis μ = 0 is false). Each complete data set was used with a standard one-sample t-test to calculate the ‘true’P-value for the data. Each data set was then exhaustively sub-sampled without replacement into smaller data sets. A one-sample t-test is performed on each of these subsets, providing a list of P-values on exactly the same data as was analysed with the global t-test. A perfect combined probability test would generate the same P-value from the list of results from the subset tests as was obtained from the t-test on the whole data set. Deviations from this match indicate error in the combination of probabilities.

This procedure represents the ideal of a combined probability test: that is, all of the studies in fact test exactly the same null hypothesis on the same population. In reality, this is never the case. Even when each study asks the same question, different studies are normally done on samples from a different populations, or even different species. As a result, there is likely more real heterogeneity among real studies to be combined than is represented by the simulation model here. Thus our error rates are likely underestimates of the true situation. On the other hand, if the studies in question are on a diverse array of taxa, then the true level of replication is the species or taxonomically independent contrast, not the individuals within a study. In these cases, the weighted Z-approach is therefore inappropriate.

Simulations were performed using Mathematica v.5.0.0.0 (Wolfram, 2003). One detail is important to mention. I found that Mathematica's Random[] routine for the normal distribution generated a significant excess of samples with large means and a deficit of samples with extremely small means. The distribution of sample means was not normal, as it should have been (P < 0.001 by a Shapiro–Wilk test, with 106 means of 40 standard normal deviates). In contrast, the distribution of individual draws from the random number generator for the normal distribution was not significantly different from a normal disitrbution, so the problem probably comes from autocorrelation among consecutive samples. A change to the Random[] routine fixes this problem (see http://forums.wolfram.com/mathgroup/archive/2000/May/msg00088.html). The function

image

was used to generate uniform random numbers, and this revised function was used in a re-write of the Random [NormalDistribution[]] function. This revision results in a normal random number generator that generates results that cannot reject the null hypothesis of normality. This problem with the uniform random number generator in Mathematica will affect many or most of its random number generator routines.

Two conditions were varied. First, the true mean of the distribution was varied, allowing measurement of the Type-I error rate (when the null hypothesis is true, μ = 0) and of the Type-II error rates under a variety of values of μ. Second, the asymmetry of sample size was varied, ranging from even division into equal samples to a range across studies of several orders of magnitude.

Most of the simulations covered two cases: one where the sample size varied and another when the sample size was constant. In either case the total sample size was held constant at 2550 (or 2552 in the constant size group), and the number of studies was held constant at eight. In the constant sample size case, each study had n = 319 individuals in the sample, while in the variable sample size group the samples sizes were n = 10, 20, 40, 80, 160, 320, 640, 1280. The true mean of the population from which these samples were drawn varied from 0 to 0.1 standard deviations, which effectively covered the range from a true null hypothesis to nearly zero Type-II error rates. For each set of parameters, 104 total data sets were analysed (105 data sets were analysed to test Type-I error rates). For each total sample, the P-value based on the total sample was calculated, as well as the combined P-value from each of the methods, based on a single generation of subsets.

Results

The results are presented in terms of the Types I and II error rates, as well as by the correlation between the combined P-value and the true P-value ascertained by the t-test on the whole data set. In practice, we want a method to combine probabilities that has Type-I error rates equal to the stated α, low Type-II error rates, and a high correlation between the combined and true P-values. We will see that all of the tested methods have the right Type-I error rates, but the weighted Z-method tends to do best on the other criteria.

Type-I error rates for each test were calculated for α = 0.01 and 0.05. If the tests are unbiased, then we would expect to reject a true null hypothesis with probability α. All tests gave Type-I error rates consistent with the expected significance level, as tested by a binomial test. In an additional set of tests, the constant sample size case was tested over a range of sample sizes from 10 samples of size 4 each to 10 samples of size 256, with 10 000 replicates of each sample configuration. In all cases, Type-I error rate was consistent with the stated α value. Each of these tests seems to be unbiased when the null hypothesis is true.

Differences among the methods for combining probabilities emerge when we consider the power of the tests, however. Figure 1 shows the probability of rejecting the null hypothesis as a function of the size of the true effect, μ. As μ gets larger, every test has more power, as expected, but Fisher's test tends to have lower power than the other tests. For this reason, the weighted Z-method should be preferred over Fisher's combined probability test.

Figure 1.

Rejection rates for false null hypotheses. These curves give the probability of rejecting the null hypothesis H0: μ0 = 0, as a function of the true value of μ on the x-axis. Letters on the curves mark simulated points, and the letters showing which method of combining probabilities was used. ‘T’ gives the rejection rate for the total data set, ‘F’ shows results for Fisher's method, ‘Z’ shows the unweighted Z-transform approach, and ‘W’ gives results for the weighted Z-method. For the constant sample size plots (on the left hand side), results for the Z-transform method are not shown as they are identical to the weighted Z-method shown. In all cases, the weighted Z-method is more likely to reject a false null hypothesis than is Fisher's method and gives results similar to analysing the data as a whole.

As discussed in the introduction, there is an argument in the statistical literature about whether to use the weighted or unweighted versions of the Z-transform test. When all of the sample sizes are equal (Fig. 1, left side), the tests are identical. When there is variation in the sample size across studies, there can be a noticeable difference in the power of the two methods, with the weighted Z-approach being superior in all cases. As such, we should always prefer the weighted Z to the unweighted Z-approach when the independent studies test the same hypothesis.

A final means to compare these methods, though unorthodox, is useful, I think. In the case of the simulations done here, all of the individual data points were in fact drawn from the same distribution, so they can be legitimately combined into a single sample, unlike most combined data sets. As a result, we are able to calculate the true P-value for each data set, with the t-test on the pooled data. As a result, we have the ability to compare the P-value given by a combination method to the true P-value, and we can ask how correlated these P-values are over replicates. Figure 2 shows scatter plots of a subset of the results. It can be readily seen that the Z-transform methods give P-values that are more highly correlated with the true P-values than does Fisher's approach. Interestingly, this is true even when the null hypothesis is true. The generality of this result is confirmed in Fig. 3.

Figure 2.

The results of combined tests compared with true P-values. The true P-value, as calculated from the pooled data set, is on the y-axis, compared with the results from either Fisher's combined probability test or the weighted Z-method. Fisher's method is less precise than the weighted Z-method. In these examples, the null hypothesis was true, μ = 0, and sample size was variable.

Figure 3.

The correlation of combined probability P-values with the true P-value. For either constant sample size (a) or variable sample size (b), the weighted Z-method gives P-values that are more precise reflections of the true P-value for those data. When sample size varies across studies, the weighted Z-approach gives more precise P-values than does the unweighted Z-transform approach. Letters denoting the simulation results are the same as in Fig. 1.

Discussion

Despite Fisher's test's dual claims for priority and widespread use, it is not the best test available to combine information from P-values of multiple tests of the same null hypothesis. From what we have seen here, the weighted Z-method originally due to Liptak (1958) and Mosteller & Bush (1954) gives results that are more precise and that have more power than Fisher's test. In our field, we have almost always used Fisher's test, and these results show that we should switch to the weighted Z-method.

Furthermore, despite untested claims to the contrary, it is better to use the weighted version of the Z-transform method rather than the unweighted method. The weighted method requires more information about each study, but this is information that is almost always available (usually, the sample sizes of each study are enough to calculate these weights). Weighting is inappropriate, though, when the studies are drawn from a range of species, because then the true level of replication is the number of species.

One caveat that is important to add to any discussion of combining P-values is that these methods can be subject to extreme publication bias. It is often the case that studies that found P-values greater than 0.05 are less likely to be published than studies that found P < 0.05 (Egger & Davey Smith, 1998; Palmer, 2000). In the extreme, if only studies with P-values less than 0.05 were published, then every null hypothesis would appear by the literature to be rejected. Combining P-values across studies increases the power of the collective analysis, but if there is publication bias the Type-I error rate can be substantially skewed from α. Fortunately, publication bias seems to be greater for smaller studies than for larger ones. A side benefit, then, of weighting larger studies more heavily is that it reduces the effects of publication bias on the assimilation of information across studies.

Many other alternative methods have been proposed to combine P-values from independent tests, and several have been reviewed by Rosenthal (1978), Rice (1990) and Becker (1994). These other methods are known to be less effective in general than those mentioned here. On the other hand, the burgeoning field of meta-analysis has many alternatives that depend on combining not the P-values but the effect sizes (see Cooper & Hedges, 1994, e.g.). These are more useful because they give more insight into the magnitudes of effects and often more powerful because they use more information from the data. We should emphasize that combining P-values is to be undertaken normally only when more complete information is not available.

In summary, Fisher's combined probability test is not the best test available for combining P-values from independent tests of the same null hypothesis. A better method is the weighted Z-approach, and this should be preferred in future applications.

Acknowledgments

I appreciate helpful comments and discussion with Dolph Schluter, Sally Otto, Nancy Heckman, and anonymous reviewers. This work was supported by a grant from the Natural Science and Engineering Research Council (Canada).

Ancillary