## Introduction

A common situation in biology is where we have count data and wish to explore whether there is an association between two unordered categorical variables. An example of this might be exploring whether the sex (male or female) of wild mice caught in a trap is related to their infection status with respect to some disease (infected or not). We will particularly focus on the situation described in this example where each variable has only two levels. Such count data can be organised into a 2 × 2 contingency table, and it is conventional to test for association between the two variables using either a Chi-squared test, G-test or Fisher's Exact Test. This approach has recently been reviewed by both Lydersen, Fagerland & Laake (2009) and Ruxton & Neuhäuser (2010a).

An alternative approach to such null hypothesis statistical testing is calculation of the effect size: in biology, the most commonly used measure of association in such cases is the *odds ratio (OR)*. For the example above, the odds of a female having the disease is simply the probability of a captured female having the disease divided by the probability of such a female not having the disease. Analogously, the odds of a male having the disease are simply the probability of a male having the disease divided by the probability of the male not having the disease. The OR is simply the quotient of these two odds (for males and females in our case). An OR of one is equivalent to no association (in our example, no difference between males and females in likelihood of having the disease). The sample OR is easily calculated for any 2 × 2 contingency table of counts:

However, even when there is truly no association (at a population level), we would not expect the sample OR for any set of observations to be identically one (simply because of stochasticity in the sampling). Hence, it would be useful to be able to calculate a confidence interval for the OR. However, there is no universally agreed method for calculating such a confidence interval. Here, we provide a review of some commonly used and recently suggested methods. We will focus on the conventional 95% confidence interval. Specifically, we will subject each method to evaluation by simulation of sample data sets. An effective method should produce confidence intervals that include the true (population) value of the OR on 95% of occasions. Values >95% suggest that the method is conservative and produces an excessively wide interval; values < 95% suggest that the method is liberal and produces an overly narrow confidence interval. Lawson (2004) compared different methods using this criterion and the length of the confidence interval, and found that both measures gave almost identical rankings between methods. Hence, we focus on closeness to the intended error rate, as we feel this is the most valuable criteria for practitioners.

In this manuscript, we focus entirely on two-sided confidence intervals. It is also possible to construct one-sided confidence intervals, although we expect the circumstances where such are appropriate to be relatively uncommon (Ruxton & Neuhäuser 2010b; and see the study by Senn 2007 for further discussion on this). Also, the appropriate measure to use in construction of one-sided confidence intervals is considerably less controversial than the two-sided case covered here. The challenge in calculating two-tailed confidence intervals is in deciding which tail to allocate particular cases to; and this problem vanishes when there is only one tail (Lloyd & Moldovan 2007). Details of an appropriate technique for calculating one-sided confidence intervals can be found in the study by Lloyd & Moldovan (2007), and software for their calculation can be found at http://www.mbs.edu/home/lloyd/homepage/research/bcu.html.

We should also note that other measures of association are also available, such as relative risks or difference in risks; however, in our survey of the analysis of contingency tables in recent behavioural ecology studies, we found no use of any other measure except the OR (Ruxton & Neuhäuser 2010a), and OR is the only measure discussed in commonly cited statistics texts aimed at biologists (e.g. Quinn & Keough 2002; Sokal & Rohlf 2005; Zar 2008). This may be because OR is a more generally applicable measure, because there are sampling scenarios where the calculation of relative risks or differences in absolute risks is not possible, but OR remains an effective measure.

In our simulations, we consider both unconstrained and singly constrained experimental designs (sometimes called unconditioned and singly conditions, respectively). The trapping study discussed above is an example of an unconstrained design where neither the row nor column totals of the contingency table are fixed beforehand. That is, neither the division of sampled animals into the two sexes nor into the two disease categories is fixed by the experimenter. In a singly constrained design, such a constraint is imposed on one of the variables. For example, we could imagine that experimenters might take a sample of 25 male mice and 25 female mice and subject each individual to a standard exposure to infectious material before subsequently testing to see whether they develop the disease or not. In this case, the distribution of individuals across one of the variables (sex) is fixed by the experimental design (there will be 25 animals of each sex), but the distribution of the other variable (disease status) is not. Doubly constrained designs, where the distribution of totals across both variables are fixed in the design are possible (and indeed Fisher's Exact test was originally developed for such a situation), but they are very rarely encountered in research science (e.g. Yates 1984) and will not be considered here.