†Author to whom correspondence should be addressed. E-mail: firstname.lastname@example.org
1Correlations between phenotypic traits are important in a number of contexts in physiological ecology, evolutionary physiology, and behaviour. Correlations can reflect functional connections or trade-offs among performance traits (e.g. bite force, jumping distance) and can reveal causal relationships between whole-organism traits and lower-level biochemical or morphological traits.
2However, when one or both traits exhibit intraindividual variability (i.e. repeatability < 1), conventional estimates of Pearson product-moment correlation coefficients are biased towards zero (= attenuated). The magnitude of this bias decreases with increases in the number of measurements used to calculate the mean value of the trait for each individual. The bias varies inversely with the repeatability of each trait.
3We present an estimator for the correlation coefficient that eliminates this bias. This estimator is based on an equation originally presented in 1904 by Spearman, and applied by researchers in psychological testing and nutritional epidemiology. The estimator is a simple function of the within- and among-individual components of variance for each of the two traits.
4Simulations show that optimal sampling effort usually involves a small number of trials per individual and a large sample of individuals (for a fixed total sample size), although correlations between traits with low repeatabilities may be more precisely estimated with a larger number of trials per individual and a smaller number of individuals.
5In addition to reducing the accuracy of the estimate, attenuation also reduces statistical power for detecting significant correlations. However, we do not recommend using the unbiased estimator for testing whether correlations differ from zero, because this inflates Type I error rates. Instead, the uncorrected (conventional) estimator should be used for hypothesis testing.
6The unbiased estimator is not appropriate for correlations involving maximum or minimum values for each individual (e.g. maximum sprint speed) because sampling distributions of these extreme values typically have different properties than the sampling distributions of individual mean values.
Here, we alert researchers to a statistical problem that is not usually recognized in integrative biology: conventional estimates of Pearson product-moment correlations are biased towards zero whenever there is intraindividual variation in one or both of the traits involved (Fuller 1987). Virtually all behavioural and physiological traits show some degree of intraindividual variability (i.e. repeatability < 1·0, e.g. Bennett 1987; Huey & Dunham 1987; Boake 1989; Hayes & Jenkins 1997; Brodie & Russell 1999; Dohm 2002). Therefore, the values of phenotypic correlations reported in the biological literature are likely to underestimate the true correlations, on average.
We present an estimator that gives a theoretically unbiased estimate of the correlation coefficient between individual trait means. In addition, we use simulations to evaluate the performance of corrected vs. uncorrected estimates of correlation coefficients and to explore the allocation of sampling effort between the number of individuals vs. the number of measurements per individual.
Bias in estimating correlation coefficients
Intraindividual variability is statistically equivalent to measurement error. Statisticians have long recognized that estimates of Pearson product-moment correlation coefficients are biased when there is measurement error in either the X or the Y variable (Spearman 1904; Thouless 1939; Fuller 1987). This bias is called ‘attenuation’ (Spearman 1904) because on average an estimate of a correlation coefficient is biased towards zero. Attenuation of correlation coefficients is well-known in some fields, particularly in psychological and educational testing (Gulliksen 1950) and nutritional epidemiology (Willett 1998). However, researchers rarely acknowledge or correct for attenuation in many fields of biology, including physiological ecology, evolutionary physiology, and animal behaviour. A related phenomenon, the attenuation of regression slopes due to measurement error in the x variable, has been addressed by several papers in these fields (e.g. McArdle 2003).
Spearman (1904) showed that the expected value of a sample correlation coefficient between two traits X and Y is given by
( eqn 1)
where ρ = the true correlation coefficient between the mean values of Traits X and Y for each individual, = the among-individual variance in Trait X and = the within-individual variance in Trait X, with corresponding notation for Trait Y. Now suppose that we make nX measurements per individual for Trait X, and nY measurements for Trait Y, and use the mean value of each individual's trials as our x,y pairs. The individual mean values have error variances and , so that the expected value of a sample correlation coefficient is
( eqn 2)
This formula or its equivalent has been presented previously by several authors in other fields (Liu et al. 1978; Beaton et al. 1979; Rosner & Willett 1988). Note that the magnitude of the multiplicative bias (the square root term) is independent of ρ and of the number of individuals sampled.
Equations 1 and 2 show that if within-individual variance is present the estimated correlation coefficient will be biased towards zero. The magnitude of the bias decreases as the ratio decreases (i.e. as repeatability increases) and as the per-individual sample sizes nX and nY increase (Fig. 1).
Equation 2 can be used to derive an unbiased estimator for the correlation coefficient:
( eqn 3)
where r is the uncorrected correlation coefficient calculated using the sample mean values of each trait for each individual (x̄ and ȳ). We have replaced the parametric variance terms (σ2) with sample variance terms (s2) in eqn 3 because these variances will almost always be estimated from data. It is possible for a given sample to yield a value of r̂corrected that exceeds 1·0 (or −1·0); in this case, the estimate should be rounded to 1·0 (or −1·0).
Because the repeatability (= intraclass correlation coefficient; Sokal & Rohlf 1994) for Trait X is given by , eqn 3 can be rewritten in terms of the repeatabilities of the two traits:
( eqn 4)
Equation 4 can be used to obtain unbiased estimates of correlations from previously published work if the repeatabilities are known, and also demonstrates that using the mean of several measurements is comparable with increasing repeatability (Falconer 1981; Arnold 1994; Hayes & Jenkins 1997).
We ran computer simulations to evaluate the performance of the unbiased estimator (eqn 3). We drew samples from a bivariate normal distribution with a known true correlation coefficient (ρ = 0·4 and 0·8) and examined the effects of number of individuals (Nsubjects = 20, 40 and 80), the number of measurements per individual (ntrials = 2, 5 and 10), and repeatability (ri = 0·2, 0·5 and 0·8). We used the same values of ntrials and ri for Traits X and Y. We added to each variate a random, normally distributed error term adjusted to yield the desired repeatability. For each parameter combination we drew 5000 independent samples, and from each sample we calculated four different estimates of the correlation coefficient.
1Uncorrected for bias, but without adding error terms (i.e. ), as a check on the simulation procedure. This should yield an unbiased estimate, on average.
2Uncorrected for bias. On average, this should yield a biased estimate (eqn 2).
3Corrected for bias using eqn 3, using the known parametric variance components. This should yield an unbiased estimate.
4Corrected for bias using eqn 3, with variance components estimated from the sample. This should yield an unbiased estimate.
Simulations were run in the statistical computing language R (R Development Core Team 2005) and yielded several noteworthy results. First, the uncorrected (conventional) correlation coefficients were biased towards zero as predicted by theory, whereas the corrected correlation coefficients on average were unbiased (Figs 2 and 3). In each case the mean correlations were essentially identical to the analytical predictions from eqns 2 and 3. Second, the sampling distributions differed in variance (Fig. 2). The variance of the distribution with was entirely due to sampling of individuals from the population. The other three distributions exhibited variance due to both sampling of individuals and sampling of values within individuals. The greater sampling variance of the unbiased estimators compared with the conventional estimator is due to the fact that the unbiased estimate involves multiplication by a scalar (the correction factor); recall that Var(a X) = a2 Var(X) for any random variable X and scalar a. The distributions of the unbiased estimators were virtually identical, indicating that estimating the variance components in eqn 3 added little variance to the total sampling variance.
Overall, the simulations confirmed that the uncorrected estimator is biased, and that the corrected estimator eliminated this bias (Fig. 3). The simulations also illustrated the effects of sample size and repeatability on bias: the bias was greatest for the smallest number of samples per individual (ntrials = 2) and for the lowest repeatabilities (ri = 0·2). In these cases, almost the entire distribution of sample correlation coefficients fell below the true value of ρ, clearly an undesirable property for an estimator. Another key result is that using a larger sample size of individuals (Nsubjects) did not reduce bias, although it did reduce the sampling variances of both the unbiased and biased estimators; recall that Nsubjects does not appear in eqn 2. The sampling variance of the unbiased estimator was greater than that of the biased (uncorrected) estimator, as discussed above for Fig. 2. However, this difference decreased with increases in repeatability and the number of trials per individual. Our results are consistent with the conclusions of Fan (2003), who conducted a similar study to address the issue of measurement reliability and its effect on correlations between test scores in educational and psychological assessment.
Hypothesis testing: which estimator to use?
At first glance it might seem preferable to use the more accurate unbiased estimator (eqn 3) for testing the null hypothesis ρ = 0. Instead, it is actually more appropriate to use the uncorrected (conventional) estimate for this hypothesis test, for the following reason: if ρ = 0, then there is no bias, and therefore no need to correct the estimate. The tables of critical values for the correlation coefficient were established for the uncorrected estimator; different tables would need to be constructed for the unbiased estimator. Using an unbiased estimate of ρ for hypothesis testing with the conventional critical values would inflate the Type I error rate beyond the desired rate, because the estimate would exceed the critical value more frequently. We verified this effect by running simulations in which ρ = 0 (results not shown). This suggests that researchers should first use the conventional estimator for testing for whether ρ = 0. Then, if H0 is rejected (i.e. ρ ≠ 0), eqn 3 should be used for an unbiased estimate of ρ.
Much less frequently, researchers wish to test null hypotheses involving nonzero values of ρ. In this case it seems reasonable to use the unbiased estimator. Rosner & Willet (1988) and Charles (2005) give formulas for the standard error and confidence intervals of r̂corrected, which could be adapted for hypothesis testing.
Allocation of research effort: number of trials vs. number of individuals
The unbiased estimator for the correlation coefficient (eqns 3 and 4) requires at least two measurements per individual so that and can be estimated. Clearly, measuring more than two values per individual has several benefits: it decreases the bias (see eqn 3), thereby increasing statistical power, and it also should give a better estimate of the variance components and . However, for a fixed total sample size (Nsubjects × ntrials) there is a trade-off between Nsubjects and ntrials. This trade-off would be particularly important to consider when nontrivial effort is required for each phenotypic measurement (e.g. in studies of exercise metabolism).
We ran simulations to explore how this trade-off between Nsubjects and ntrials influences the precision of estimating ρ. We assumed that a researcher could make 160 total measurements, and chose combinations of Nsubjects (from 4 to 80) and ntrials (from 2 to 40) that would yield this total. We then obtained 5000 random samples for each allocation and for each possible combination of repeatabilities (ri = 0·2, 0·5, 0·8) and true correlations (ρ = 0·4, 0·8), using the same repeatabilities and sample sizes for Traits X and Y.
In the intermediate- and high-repeatability cases, the sampling variance of the correlation coefficient decreased with higher values of Nsubjects and lower values of ntrials (Fig. 4). This decrease in sampling variance was most consistent in the high-repeatability case. In the low-repeatability simulation, the sampling variance was lowest at intermediate values of Nsubjects and ntrials. These findings were essentially the same for high (ρ = 0·8) and low (ρ = 0·4) true correlations. These results suggest that the optimal allocation of sampling effort, using the criterion of minimizing the sampling variance of the correlation coefficient (i.e. increasing the precision of the estimate), depends on repeatability. For traits with high repeatability, the best design appears to involve making two measurements on as many individuals as possible. For traits with intermediate repeatabilities, sampling variance was relatively insensitive to allocation as long as Nsubjects > ntrials. For traits with low repeatabilities, choosing a higher ntrials at the expense of a lower Nsubjects appears to yield slightly more precise estimates of ρ. When repeatabilities are unknown prior to beginning an experiment, Fig. 4 suggests that choosing a small value of ntrials (i.e. 2, 3 or 4), along with the highest feasible value of Nsubjects, would probably yield the most precise estimate of ρ.
The trade-off between ntrials and Nsubjects has implications for statistical power as well as for the precision of the estimate. Increasing either ntrials or Nsubjects will increase statistical power for detecting nonzero correlation coefficients, but for different reasons. Using a larger Nsubjects increases the degrees of freedom for the hypothesis test, and also decreases the sampling variance of the estimated correlation coefficient (see Fig. 3). Using a larger ntrials decreases bias (eqn 2), which could be a substantial benefit when repeatability is low. Given a limited total sample size, increasing Nsubjects is probably a more effective way to increase power than increasing ntrials.
Correlations involving maximum rather than mean values
In many studies of organismal performance, such as burst speed (Bennett 1980; Wilson 2005), jumping distance (Watkins 1997), and bite force (Huyghe et al. 2005), researchers use maximum rather than mean values because maxima are likely to be more relevant measures of performance in ecological and evolutionary contexts. Like mean values, maximum values are estimated with error due to intraindividual variability; individuals do not always perform at their physiological maximum. Therefore, when a correlation is calculated between individual maximum values for two traits, it will be closer to zero on average than the true correlation.
Unfortunately, the unbiased estimator presented in eqn 3 is not appropriate for correlations between maximum values, because the sampling distribution of a maximum is different than the sampling distribution of a mean value. Sampling distributions of maximum values are more complicated than those of mean values (Gumbel 1958); for example, they depend on the underlying distribution of individual performances, whereas the distribution of sample means is approximately normal for a variety of underlying distributions (hence the Central Limit Theorem). Also, the expected value of a maximum increases with ntrials (Losos, Creer & Schulte 2002; S.C. Adolph & T. Pickering, unpublished MS), whereas the expected value of a mean is independent of ntrials. Because of these complications, new statistical procedures need to be devised to correct for bias in correlations between maximum performances. We suggest that empirical studies report correlations between performance traits using individual mean values in addition to correlations using individual maxima, particularly when the goal is to identify physiological and morphological traits that affect whole-organism performance.
We thank J.-F. Le Galliard and three anonymous reviewers for helpful comments on the manuscript, T. Pickering and H. Groves for their laboratory and simulation work that led to this project, T. Garland and V. Malcarne for suggesting key references, and R. Carroll for thoughtful discussions about correlation bias. S.C.A. thanks Monica Geber and the Department of Ecology and Evolutionary Biology at Cornell University for providing space and assistance during his sabbatical. Funding for this research was provided by a grant from the W.M. Keck Foundation to the Center for Quantitative Life Sciences at Harvey Mudd College.