Review of alternative approaches to calculation of a confidence interval for the odds ratio of a 2 × 2 contingency table
Correspondence author. E-mail: firstname.lastname@example.org
A common situation in biology is where we have count data and wish to explore whether there is an association between two categorical variables, each with two levels (a 2 × 2 contingency table). The size of the association can be measured using the odds ratio, with a confidence interval for this measure enclosing unity suggesting no evidence of an association. However, there is no universally agreed method for calculating such a confidence interval.
Here, we provide a review of some commonly used and recently suggested methods.
Of all of the methods currently available, the unconditional approach based on the score statistic was consistently closest to the nominal type I error level in our investigations, and this is the method we generally recommend. This method also offers good agreement with P-values from null hypothesis testing using the method of Fisher-Boschloo.
However, some scientists may prefer the recently developed minlike or Blaker methods, which offered better agreement with P-values calculated using Fisher's Exact test or Blaker's Exact test, respectively.
Lastly, where calculation without use of a computer is required, we recommend the Woolf method with Haldane-Anscombe correction.
A common situation in biology is where we have count data and wish to explore whether there is an association between two unordered categorical variables. An example of this might be exploring whether the sex (male or female) of wild mice caught in a trap is related to their infection status with respect to some disease (infected or not). We will particularly focus on the situation described in this example where each variable has only two levels. Such count data can be organised into a 2 × 2 contingency table, and it is conventional to test for association between the two variables using either a Chi-squared test, G-test or Fisher's Exact Test. This approach has recently been reviewed by both Lydersen, Fagerland & Laake (2009) and Ruxton & Neuhäuser (2010a).
An alternative approach to such null hypothesis statistical testing is calculation of the effect size: in biology, the most commonly used measure of association in such cases is the odds ratio (OR). For the example above, the odds of a female having the disease is simply the probability of a captured female having the disease divided by the probability of such a female not having the disease. Analogously, the odds of a male having the disease are simply the probability of a male having the disease divided by the probability of the male not having the disease. The OR is simply the quotient of these two odds (for males and females in our case). An OR of one is equivalent to no association (in our example, no difference between males and females in likelihood of having the disease). The sample OR is easily calculated for any 2 × 2 contingency table of counts:
However, even when there is truly no association (at a population level), we would not expect the sample OR for any set of observations to be identically one (simply because of stochasticity in the sampling). Hence, it would be useful to be able to calculate a confidence interval for the OR. However, there is no universally agreed method for calculating such a confidence interval. Here, we provide a review of some commonly used and recently suggested methods. We will focus on the conventional 95% confidence interval. Specifically, we will subject each method to evaluation by simulation of sample data sets. An effective method should produce confidence intervals that include the true (population) value of the OR on 95% of occasions. Values >95% suggest that the method is conservative and produces an excessively wide interval; values < 95% suggest that the method is liberal and produces an overly narrow confidence interval. Lawson (2004) compared different methods using this criterion and the length of the confidence interval, and found that both measures gave almost identical rankings between methods. Hence, we focus on closeness to the intended error rate, as we feel this is the most valuable criteria for practitioners.
In this manuscript, we focus entirely on two-sided confidence intervals. It is also possible to construct one-sided confidence intervals, although we expect the circumstances where such are appropriate to be relatively uncommon (Ruxton & Neuhäuser 2010b; and see the study by Senn 2007 for further discussion on this). Also, the appropriate measure to use in construction of one-sided confidence intervals is considerably less controversial than the two-sided case covered here. The challenge in calculating two-tailed confidence intervals is in deciding which tail to allocate particular cases to; and this problem vanishes when there is only one tail (Lloyd & Moldovan 2007). Details of an appropriate technique for calculating one-sided confidence intervals can be found in the study by Lloyd & Moldovan (2007), and software for their calculation can be found at http://www.mbs.edu/home/lloyd/homepage/research/bcu.html.
We should also note that other measures of association are also available, such as relative risks or difference in risks; however, in our survey of the analysis of contingency tables in recent behavioural ecology studies, we found no use of any other measure except the OR (Ruxton & Neuhäuser 2010a), and OR is the only measure discussed in commonly cited statistics texts aimed at biologists (e.g. Quinn & Keough 2002; Sokal & Rohlf 2005; Zar 2008). This may be because OR is a more generally applicable measure, because there are sampling scenarios where the calculation of relative risks or differences in absolute risks is not possible, but OR remains an effective measure.
In our simulations, we consider both unconstrained and singly constrained experimental designs (sometimes called unconditioned and singly conditions, respectively). The trapping study discussed above is an example of an unconstrained design where neither the row nor column totals of the contingency table are fixed beforehand. That is, neither the division of sampled animals into the two sexes nor into the two disease categories is fixed by the experimenter. In a singly constrained design, such a constraint is imposed on one of the variables. For example, we could imagine that experimenters might take a sample of 25 male mice and 25 female mice and subject each individual to a standard exposure to infectious material before subsequently testing to see whether they develop the disease or not. In this case, the distribution of individuals across one of the variables (sex) is fixed by the experimental design (there will be 25 animals of each sex), but the distribution of the other variable (disease status) is not. Doubly constrained designs, where the distribution of totals across both variables are fixed in the design are possible (and indeed Fisher's Exact test was originally developed for such a situation), but they are very rarely encountered in research science (e.g. Yates 1984) and will not be considered here.
Materials and methods
In our singly constrained simulations, we condition on one of the variables: that is, we fix the total sample sizes of one of the variables. In terms of our example above, we specify the numbers of males and females included in the study. We denote these sample sizes in the two categories for the first variable by N 1 and N 2. These are the marginal totals of one variable in the contingency table. In our unconditioned simulations, we specify N as the total sample size and p c as the probability that a given individual contributes to the first category of the first variable. That is, each individual is independently and stochastically allocated to the first or second of these categories of the first variable based on this probability, with either N 1 or N 2 being incremented accordingly for each individual.
For both the unconditioned and singly conditioned cases, we then specify the probabilities associated with the two categories of the other variable: p 1 and p 2. These are the probabilities of being in the first category of the second variable for, respectively, individuals in the first and second category on the first variable. These parameter values are used to generate examples of 2 × 2 contingency tables, in concert with the values of N 1 or N 2 as defined above. Specifically, for each of N 1 experimental units (male mice in our example), we select a uniform random number between zero and one, if that number is less than p 1, we allocate the mouse to having the disease (incrementing the total number of individuals in the first category for both variables: N 11); otherwise, we allocate it to not having the disease (incrementing N 12). Similarly for each of N 2 experimental units (female mice in our example), we select a uniform random number between zero and one, if that number is less than p 2, we allocate the mouse to having the disease (incrementing N 21); otherwise, we allocate it to not having the disease (incrementing N 22). The true population value of the OR is given by
For each of 10 000 replicate contingency tables generated as described above (with identical values of N 1, N 2, p 1 & p 2 for the singly constrained case; N, p c , p 1 & p 2 for the unconstrained case), we generate a confidence interval for OR by a given specified method, then evaluate the fraction of such confidence intervals that contain the true value given above. We evaluate this for a number of different combinations of N 1, N 2, p 1 & p 2 or N, p c , p 1 & p 2.
Our results for the singly constrained case are given in Table 1; those for the unconstrained case in Table 2
Perhaps the most obvious way for most biologists to construct a 95% confidence interval for the OR is to observe that the baseline stats module in the statistical software R provides a function fisher.test which (as well as performing Fisher's Exact test) also produces an estimate of the confidence interval for the OR, using a method originally recommended by Fisher (1962). Our evaluation of the performance of this method is shown in Tables 1 and 2. We can see that the method is consistently conservative: producing larger confidence intervals that enclose considerably more than the nominal 95% of true population ORs. Although this effect decreases very slightly with increasing sample size, the effect persists even for relatively large sample sizes.
The singly constrained case. For each of 10 000 replicates (with identical values of N 1
, N 2
, p 1
& p 2
), we generate a confidence interval for the odds ratio by the specified method, then evaluate the fraction of such confidence intervals that contain the true value
The unconstrained case. For each of 10 000 replicates (with identical values of N
, p c
, p 1
& p 2
) we generate a confidence interval for theratio by the specified method, then evaluate the fraction of such confidence intervals that contain the true value
The R module epitools offers four alternatives for calculating the confidence interval based on its oddsratio function. We do not present a full evaluation of two of these methods, as these methods failed to produce a confidence interval if one or more cells in the contingency table contained a zero. These methods were OR calculated by median-unbiased estimation with confidence interval calculated by exact methods, and OR calculated by unconditional maximum likelihood with confidence interval generated through a normal approximation. We did explore the other two methods available through this function: ‘epi-small’ refers to OR estimation by small-sample adjustment and confidence interval around it estimated by normal approximation with small-sample adjustment, and ‘epi-fisher’ which involved estimation of OR by conditional maximum likelihood with confidence intervals calculated by exact methods. Full details of these methods can be found in the study by Jewell (2004). Epi-small appears to perform generally poorer than fisher.test in terms of the probability of coverage including the true value in our simulations (estimated type 1 error rate was closer to 0·95 for epi-small than fisher.test on only 5 of the 21 scenarios in Table 1 and 8 of the 24 cases in Table 2). The second method (epi-fisher) was generally (but not always) slightly better than fisher.test (coverage closer to 0·95 on 19 of 21 cases in Table 1 and 23 of 24 cases in Table 2). Like fisher.test, both epi-fisher and epi-small were always conservative.
We next evaluated Woolf's method with Haldane-Anscombe correction, which was one of methods recommended by Lawson (2004) after evaluation of 10 alternative methods. This method is available in R through the Prop.or function of the pairwiseCI module and is described fully by Lawson (2004). The results in Tables 1 and 2 suggest that it performs generally very similarly to fisher.test and was inferior to epi.fisher (further from 0·95 on 19 of 21 cases in Table 1 and 23 of 24 cases in Table 2).
Fay (2010) presents three different ways to calculate confidence intervals for the OR. In the first method, termed minlike, Fisher's exact test is calculated such that the P-value is the sum of all probabilities equal to or less than that of the observed table under the null hypothesis. In the second case, termed central, the P-value is defined as the minimum of one and twice the smallest of the one-sided P-values. Finally, in a method termed Blaker after Blaker (2000), the P-value is the sum of the observed one-sided tail probability and the largest tail probability on the other side that is not larger than the observed one. In each case, the confidence interval is obtained by inversion. In the first and third cases inversion need not result in a single continuous interval, hence Fay (2010) provides a new algorithm to obtain such an interval. These three methods are implemented in the R module exact 2 × 2. Tables 1 and 2 suggests that all three methods are always conservative. Minlike seems to perform very similarly to the best-performing test considered so far (epi-fisher: closer to 0·95 on 10 of 21 cases in Table 1 and 12 of 24 cases in Table 2). Minlike appears to perform better than central (closer to 0·95 on 20/21 cases in Table 1 and 21/24 in Table 2), but similarly to Blaker (closer to 0·95 on 13/21 occasion in Table 1 & 12/24 occasions in Table 2).
All the methods described so far use the traditional approach to exact small-sample interval estimation where the conditional distribution is used to eliminate nuisance parameters, and there are theoretical reasons why this necessarily leads to conservative estimation (Agresti & Min 2002; Lin & Yang 2006). Agresti & Min (2002) and Agresti (2003) implement an unconditional approach based on the score statistic, which should be less conservative (Agresti 1999). This method is available through the function orscoreci, in the R package PropCIs. Our evaluation in Tables 1 and 2 suggests that this method is never liberal but remains closer to the nominal 95% method than any of the other methods (Table 1 suggests that it was closer to 0·95 than all of minilike, Blacker and epi.fisher on 21/21 occasions; Table 2 suggests that it was closer on 0·95 than all three on at least 22/24 of the scenarios considered).
Our analysis allows us to offer some tentative advice about construction of a confidence interval for the OR. Of all of the methods currently available, the unconditional approach based on the score statistic (introduced by Agresti & Min 2002 and available in the PropCIs module of R) remains consistently closest to the nominal 95% level (and is always conservative when it differs from 95%), and this is the method we would generally recommend (both in unconstrained and singly constrained cases). Our advice is tentative because it is based only on a small amount of simulation data, and there is no guarantee that our simulated situations are a good guide to all biologically encountered ones. Also we consider only 95% confidence, and using such an arbitrary confidence threshold is particularly troublesome for discrete data, where only a finite number of possible tables are possible and finding 95% confidence often involves extrapolation from the discrete probabilities that do exist for a given situation.
Some scientists may prefer either the minlike or Blaker method of Faye (2000) and implemented in the R module exact2 × 2. They have the advantage that if the confidence interval is presented along with results of a test of the null hypothesis (using the method appropriate to this technique, as defined above: the Fisher Exact test for minlike or the Exact test of Blaker (2000) for Blaker), then the two will normally be in agreement. That is, if the null hypothesis test gives a P-value <0·05 then in a high percentage of occasions the associated confidence interval will not contain the value ‘one’, whereas it normally will contain ‘one’ when the P-value of the Fisher test is above 0·05. As discussed by Faye (2002), this congruence between the null hypothesis testing and effect-size approaches cannot be guaranteed for their method, but our simulations (not shown) suggests congruence for 94% of our generated tables for minlike and 97% for Blaker, and this was higher than for any other method. Hence, we would also recommend the minlike or Blaker approaches for scientists who intend to present both the confidence interval for the OR and present a P-value related to testing the null hypothesis of no association using Fisher's exact test or Blaker's exact test. However, both Lydersen, Fagerland & Laake (2009) and Ruxton & Neuhäuser (2010a) recommend the unconditional Fisher-Boschloo test (implemented in the R package Exact) over the Fisher's Exact test and other exact tests, because of its greater power. If this test is used to test the null hypothesis of no association, we find that it gives 98·3% congruence with the unconditional approach based on the score statistic but only 87% and 89% congruence with the minlike and Blaker methods, respectively. Hence, our recommended method to be used alongside null hypothesis testing depends on the method used to test the null hypothesis.
We note that all our recommended methods are relatively complex to implement, and may not be attractive to those who do not use R. In this case, we recommend the Woolf method. As our results in Tables 1 and 2 demonstrate, this method offers respectable, consistently conservative performance, it can be obtained by any scientist in a matter of minutes, with a pocket calculator. Specifically, in its original form due to Woolf (1955), the 1-α confidence interval is given by
where, z a is the standard z-score for nominal significance level α, L is the sample estimate of the OR as defined in the introduction and
The correction due to Haldane (1940) and Anscombe (1956) is simply to add 0·5 to all the cells prior to the calculations.
No matter which method a researcher selects, we hope that the simulations in our study help them to interpret their own results to best effect, especially with regard to the issue of conservatism of tests discussed above.
We thank Stephen Senn and two other anonymous reviewers for thoughtful and useful comments on earlier versions.