Estimating phenotypic correlations: correcting for bias due to intraindividual variability


  • S. C. ADOLPH,

    Corresponding author
    1. Department of Biology, Harvey Mudd College, 301 Platt Blvd., Claremont, CA 91711 USA,
      †Author to whom correspondence should be addressed. E-mail:
    Search for more papers by this author
  • J. S. HARDIN

    1. Department of Mathematics, Pomona College, 610 N. College Ave., Claremont, CA 91711 USA
    Search for more papers by this author

†Author to whom correspondence should be addressed. E-mail:


  • 1Correlations between phenotypic traits are important in a number of contexts in physiological ecology, evolutionary physiology, and behaviour. Correlations can reflect functional connections or trade-offs among performance traits (e.g. bite force, jumping distance) and can reveal causal relationships between whole-organism traits and lower-level biochemical or morphological traits.
  • 2However, when one or both traits exhibit intraindividual variability (i.e. repeatability < 1), conventional estimates of Pearson product-moment correlation coefficients are biased towards zero (= attenuated). The magnitude of this bias decreases with increases in the number of measurements used to calculate the mean value of the trait for each individual. The bias varies inversely with the repeatability of each trait.
  • 3We present an estimator for the correlation coefficient that eliminates this bias. This estimator is based on an equation originally presented in 1904 by Spearman, and applied by researchers in psychological testing and nutritional epidemiology. The estimator is a simple function of the within- and among-individual components of variance for each of the two traits.
  • 4Simulations show that optimal sampling effort usually involves a small number of trials per individual and a large sample of individuals (for a fixed total sample size), although correlations between traits with low repeatabilities may be more precisely estimated with a larger number of trials per individual and a smaller number of individuals.
  • 5In addition to reducing the accuracy of the estimate, attenuation also reduces statistical power for detecting significant correlations. However, we do not recommend using the unbiased estimator for testing whether correlations differ from zero, because this inflates Type I error rates. Instead, the uncorrected (conventional) estimator should be used for hypothesis testing.
  • 6The unbiased estimator is not appropriate for correlations involving maximum or minimum values for each individual (e.g. maximum sprint speed) because sampling distributions of these extreme values typically have different properties than the sampling distributions of individual mean values.


Correlations between phenotypic traits are important in diverse contexts in animal behaviour, evolution and integrative biology. For example, phenotypic correlations are used to assess functional relationships between biochemical, morphological and whole-organism traits (Garland 1984; Bennett, Garland & Else 1989; Steyermark et al. 2005), and to identify behavioural syndromes (Huntingford 1976; Sih, Bell & Johnson 2004a; Sih et al. 2004b).

Here, we alert researchers to a statistical problem that is not usually recognized in integrative biology: conventional estimates of Pearson product-moment correlations are biased towards zero whenever there is intraindividual variation in one or both of the traits involved (Fuller 1987). Virtually all behavioural and physiological traits show some degree of intraindividual variability (i.e. repeatability < 1·0, e.g. Bennett 1987; Huey & Dunham 1987; Boake 1989; Hayes & Jenkins 1997; Brodie & Russell 1999; Dohm 2002). Therefore, the values of phenotypic correlations reported in the biological literature are likely to underestimate the true correlations, on average.

We present an estimator that gives a theoretically unbiased estimate of the correlation coefficient between individual trait means. In addition, we use simulations to evaluate the performance of corrected vs. uncorrected estimates of correlation coefficients and to explore the allocation of sampling effort between the number of individuals vs. the number of measurements per individual.

Bias in estimating correlation coefficients

Intraindividual variability is statistically equivalent to measurement error. Statisticians have long recognized that estimates of Pearson product-moment correlation coefficients are biased when there is measurement error in either the X or the Y variable (Spearman 1904; Thouless 1939; Fuller 1987). This bias is called ‘attenuation’ (Spearman 1904) because on average an estimate of a correlation coefficient is biased towards zero. Attenuation of correlation coefficients is well-known in some fields, particularly in psychological and educational testing (Gulliksen 1950) and nutritional epidemiology (Willett 1998). However, researchers rarely acknowledge or correct for attenuation in many fields of biology, including physiological ecology, evolutionary physiology, and animal behaviour. A related phenomenon, the attenuation of regression slopes due to measurement error in the x variable, has been addressed by several papers in these fields (e.g. McArdle 2003).

Spearman (1904) showed that the expected value of a sample correlation coefficient between two traits X and Y is given by

image( eqn 1)

where ρ = the true correlation coefficient between the mean values of Traits X and Y for each individual, inline image = the among-individual variance in Trait X and inline image = the within-individual variance in Trait X, with corresponding notation for Trait Y. Now suppose that we make nX measurements per individual for Trait X, and nY measurements for Trait Y, and use the mean value of each individual's trials as our x,y pairs. The individual mean values have error variances inline image and inline image, so that the expected value of a sample correlation coefficient is

image( eqn 2)

This formula or its equivalent has been presented previously by several authors in other fields (Liu et al. 1978; Beaton et al. 1979; Rosner & Willett 1988). Note that the magnitude of the multiplicative bias (the square root term) is independent of ρ and of the number of individuals sampled.

Equations 1 and 2 show that if within-individual variance is present the estimated correlation coefficient will be biased towards zero. The magnitude of the bias decreases as the ratio inline image decreases (i.e. as repeatability increases) and as the per-individual sample sizes nX and nY increase (Fig. 1).

Figure 1.

Multiplicative bias of the estimated correlation coefficient between two traits when each trait exhibits intraindividual variation. Each curve shows the expected value of the correlation coefficient (eqn 2) that would be obtained from a sample, shown as a percentage of its true value, when the score for each individual is the mean of a given number of trials (= measurements). Curves are shown for four representative values of repeatability (intraclass correlation coefficient, ri); for these examples, the repeatability is the same for traits X and Y, and the number of trials per individual are the same for both traits (nX = nY).

Equation 2 can be used to derive an unbiased estimator for the correlation coefficient:

image( eqn 3)

where r is the uncorrected correlation coefficient calculated using the sample mean values of each trait for each individual ( and ȳ). We have replaced the parametric variance terms (σ2) with sample variance terms (s2) in eqn 3 because these variances will almost always be estimated from data. It is possible for a given sample to yield a value of corrected that exceeds 1·0 (or −1·0); in this case, the estimate should be rounded to 1·0 (or −1·0).

Because the repeatability (= intraclass correlation coefficient; Sokal & Rohlf 1994) for Trait X is given by inline image, eqn 3 can be rewritten in terms of the repeatabilities of the two traits:

image( eqn 4)

Equation 4 can be used to obtain unbiased estimates of correlations from previously published work if the repeatabilities are known, and also demonstrates that using the mean of several measurements is comparable with increasing repeatability (Falconer 1981; Arnold 1994; Hayes & Jenkins 1997).

We ran computer simulations to evaluate the performance of the unbiased estimator (eqn 3). We drew samples from a bivariate normal distribution with a known true correlation coefficient (ρ = 0·4 and 0·8) and examined the effects of number of individuals (Nsubjects = 20, 40 and 80), the number of measurements per individual (ntrials = 2, 5 and 10), and repeatability (ri = 0·2, 0·5 and 0·8). We used the same values of ntrials and ri for Traits X and Y. We added to each variate a random, normally distributed error term adjusted to yield the desired repeatability. For each parameter combination we drew 5000 independent samples, and from each sample we calculated four different estimates of the correlation coefficient.

  • 1Uncorrected for bias, but without adding error terms (i.e. inline image), as a check on the simulation procedure. This should yield an unbiased estimate, on average.
  • 2Uncorrected for bias. On average, this should yield a biased estimate (eqn 2).
  • 3Corrected for bias using eqn 3, using the known parametric variance components. This should yield an unbiased estimate.
  • 4Corrected for bias using eqn 3, with variance components estimated from the sample. This should yield an unbiased estimate.

Simulations were run in the statistical computing language R (R Development Core Team 2005) and yielded several noteworthy results. First, the uncorrected (conventional) correlation coefficients were biased towards zero as predicted by theory, whereas the corrected correlation coefficients on average were unbiased (Figs 2 and 3). In each case the mean correlations were essentially identical to the analytical predictions from eqns 2 and 3. Second, the sampling distributions differed in variance (Fig. 2). The variance of the distribution with inline image was entirely due to sampling of individuals from the population. The other three distributions exhibited variance due to both sampling of individuals and sampling of values within individuals. The greater sampling variance of the unbiased estimators compared with the conventional estimator is due to the fact that the unbiased estimate involves multiplication by a scalar (the correction factor); recall that Var(a X) = a2 Var(X) for any random variable X and scalar a. The distributions of the unbiased estimators were virtually identical, indicating that estimating the variance components in eqn 3 added little variance to the total sampling variance.

Figure 2.

Sampling distributions of Pearson product-moment correlation coefficients obtained via simulation. Ten thousand independent samples of N = 40 observations were drawn from a bivariate normal distribution with ρ = 0·4 (indicated by the vertical line) and repeatabilities ri,X = ri,Y = 0·5. Correlations were calculated using the mean values of two trials per individual for each trait (ntrials,X = ntrials,Y = 2). Dashed line: conventional correlation coefficient, uncorrected for bias. Dotted line: unbiased correlation coefficient (eqn 4), calculated using the known parametric values of the variance components. Solid line: unbiased correlation coefficient (eqn 4), with variance components estimated from the sample. Dashes and dots: correlation coefficients calculated for observations with zero intraindividual variability (ri,X = ri,Y= 1·0); the variance of this distribution reflects only the sampling variation due to the finite number of individuals.

Figure 3.

Sampling distributions of Pearson product-moment correlation coefficients obtained via computer simulation, for selected sample sizes and repeatabilities. Each panel shows the median (heavy line) and central 90% of sample correlation coefficients from 5000 independent samples, using three different estimators: dark grey is the conventional estimator; light grey is the unbiased estimator with variance components estimated from the sample, and white is the unbiased estimator using known variance components. Each panel includes results for three different values of repeatability (ri,X = ri,Y = 0·2, 0·5 or 0·8). The horizontal line in each panel indicates the true value of ρ (either 0·4 or 0·8). Three different sample sizes of individuals are used (Nsubjects = 20, 40 or 80), and three different numbers of trials for each individual (ntrials = 2, 5 or 10).

Overall, the simulations confirmed that the uncorrected estimator is biased, and that the corrected estimator eliminated this bias (Fig. 3). The simulations also illustrated the effects of sample size and repeatability on bias: the bias was greatest for the smallest number of samples per individual (ntrials = 2) and for the lowest repeatabilities (ri = 0·2). In these cases, almost the entire distribution of sample correlation coefficients fell below the true value of ρ, clearly an undesirable property for an estimator. Another key result is that using a larger sample size of individuals (Nsubjects) did not reduce bias, although it did reduce the sampling variances of both the unbiased and biased estimators; recall that Nsubjects does not appear in eqn 2. The sampling variance of the unbiased estimator was greater than that of the biased (uncorrected) estimator, as discussed above for Fig. 2. However, this difference decreased with increases in repeatability and the number of trials per individual. Our results are consistent with the conclusions of Fan (2003), who conducted a similar study to address the issue of measurement reliability and its effect on correlations between test scores in educational and psychological assessment.

Hypothesis testing: which estimator to use?

At first glance it might seem preferable to use the more accurate unbiased estimator (eqn 3) for testing the null hypothesis ρ = 0. Instead, it is actually more appropriate to use the uncorrected (conventional) estimate for this hypothesis test, for the following reason: if ρ = 0, then there is no bias, and therefore no need to correct the estimate. The tables of critical values for the correlation coefficient were established for the uncorrected estimator; different tables would need to be constructed for the unbiased estimator. Using an unbiased estimate of ρ for hypothesis testing with the conventional critical values would inflate the Type I error rate beyond the desired rate, because the estimate would exceed the critical value more frequently. We verified this effect by running simulations in which ρ = 0 (results not shown). This suggests that researchers should first use the conventional estimator for testing for whether ρ = 0. Then, if H0 is rejected (i.e. ρ ≠ 0), eqn 3 should be used for an unbiased estimate of ρ.

Much less frequently, researchers wish to test null hypotheses involving nonzero values of ρ. In this case it seems reasonable to use the unbiased estimator. Rosner & Willet (1988) and Charles (2005) give formulas for the standard error and confidence intervals of corrected, which could be adapted for hypothesis testing.

Allocation of research effort: number of trials vs. number of individuals

The unbiased estimator for the correlation coefficient (eqns 3 and 4) requires at least two measurements per individual so that inline image and inline image can be estimated. Clearly, measuring more than two values per individual has several benefits: it decreases the bias (see eqn 3), thereby increasing statistical power, and it also should give a better estimate of the variance components inline image and inline image. However, for a fixed total sample size (Nsubjects × ntrials) there is a trade-off between Nsubjects and ntrials. This trade-off would be particularly important to consider when nontrivial effort is required for each phenotypic measurement (e.g. in studies of exercise metabolism).

We ran simulations to explore how this trade-off between Nsubjects and ntrials influences the precision of estimating ρ. We assumed that a researcher could make 160 total measurements, and chose combinations of Nsubjects (from 4 to 80) and ntrials (from 2 to 40) that would yield this total. We then obtained 5000 random samples for each allocation and for each possible combination of repeatabilities (ri = 0·2, 0·5, 0·8) and true correlations (ρ = 0·4, 0·8), using the same repeatabilities and sample sizes for Traits X and Y.

In the intermediate- and high-repeatability cases, the sampling variance of the correlation coefficient decreased with higher values of Nsubjects and lower values of ntrials (Fig. 4). This decrease in sampling variance was most consistent in the high-repeatability case. In the low-repeatability simulation, the sampling variance was lowest at intermediate values of Nsubjects and ntrials. These findings were essentially the same for high (ρ = 0·8) and low (ρ = 0·4) true correlations. These results suggest that the optimal allocation of sampling effort, using the criterion of minimizing the sampling variance of the correlation coefficient (i.e. increasing the precision of the estimate), depends on repeatability. For traits with high repeatability, the best design appears to involve making two measurements on as many individuals as possible. For traits with intermediate repeatabilities, sampling variance was relatively insensitive to allocation as long as Nsubjects > ntrials. For traits with low repeatabilities, choosing a higher ntrials at the expense of a lower Nsubjects appears to yield slightly more precise estimates of ρ. When repeatabilities are unknown prior to beginning an experiment, Fig. 4 suggests that choosing a small value of ntrials (i.e. 2, 3 or 4), along with the highest feasible value of Nsubjects, would probably yield the most precise estimate of ρ.

Figure 4.

Effect of trading off the number of trials per individual (ntrials) vs. number of individuals (Nsubjects) on the sampling variance of unbiased correlation coefficients obtained via simulation. Each bar shows the median and central 90% of sample correlation coefficients (eqn 2) from 5000 samples drawn from a bivariate normal distribution. The total number of observations (ntrials × Nsubjects) was fixed at 160; the combination of ntrials and Nsubjects corresponding to each bar is indicated. Top three panels: ρ = 0·4; bottom three panels: ρ = 0·8; horizontal line indicates true value of ρ.

The trade-off between ntrials and Nsubjects has implications for statistical power as well as for the precision of the estimate. Increasing either ntrials or Nsubjects will increase statistical power for detecting nonzero correlation coefficients, but for different reasons. Using a larger Nsubjects increases the degrees of freedom for the hypothesis test, and also decreases the sampling variance of the estimated correlation coefficient (see Fig. 3). Using a larger ntrials decreases bias (eqn 2), which could be a substantial benefit when repeatability is low. Given a limited total sample size, increasing Nsubjects is probably a more effective way to increase power than increasing ntrials.

Correlations involving maximum rather than mean values

In many studies of organismal performance, such as burst speed (Bennett 1980; Wilson 2005), jumping distance (Watkins 1997), and bite force (Huyghe et al. 2005), researchers use maximum rather than mean values because maxima are likely to be more relevant measures of performance in ecological and evolutionary contexts. Like mean values, maximum values are estimated with error due to intraindividual variability; individuals do not always perform at their physiological maximum. Therefore, when a correlation is calculated between individual maximum values for two traits, it will be closer to zero on average than the true correlation.

Unfortunately, the unbiased estimator presented in eqn 3 is not appropriate for correlations between maximum values, because the sampling distribution of a maximum is different than the sampling distribution of a mean value. Sampling distributions of maximum values are more complicated than those of mean values (Gumbel 1958); for example, they depend on the underlying distribution of individual performances, whereas the distribution of sample means is approximately normal for a variety of underlying distributions (hence the Central Limit Theorem). Also, the expected value of a maximum increases with ntrials (Losos, Creer & Schulte 2002; S.C. Adolph & T. Pickering, unpublished MS), whereas the expected value of a mean is independent of ntrials. Because of these complications, new statistical procedures need to be devised to correct for bias in correlations between maximum performances. We suggest that empirical studies report correlations between performance traits using individual mean values in addition to correlations using individual maxima, particularly when the goal is to identify physiological and morphological traits that affect whole-organism performance.


We thank J.-F. Le Galliard and three anonymous reviewers for helpful comments on the manuscript, T. Pickering and H. Groves for their laboratory and simulation work that led to this project, T. Garland and V. Malcarne for suggesting key references, and R. Carroll for thoughtful discussions about correlation bias. S.C.A. thanks Monica Geber and the Department of Ecology and Evolutionary Biology at Cornell University for providing space and assistance during his sabbatical. Funding for this research was provided by a grant from the W.M. Keck Foundation to the Center for Quantitative Life Sciences at Harvey Mudd College.