Guidelines for estimating repeatability


  • Matthew E. Wolak,

    Corresponding author
    1. Graduate Program in Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA 92521, USA
    2. Department of Biology, University of California, Riverside, CA 92521, USA
      Correspondence author. E-mail:
    Search for more papers by this author
  • Daphne J. Fairbairn,

    1. Graduate Program in Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA 92521, USA
    2. Department of Biology, University of California, Riverside, CA 92521, USA
    Search for more papers by this author
  • Yale R. Paulsen

    1. Graduate Program in Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA 92521, USA
    2. Department of Biology, University of California, Riverside, CA 92521, USA
    Search for more papers by this author

Correspondence author. E-mail:


1. Researchers frequently take repeated measurements of individuals in a sample with the goal of quantifying the proportion of the total variation that can be attributed to variation among individuals vs. variation among measurements within individuals. The proportion of the variation attributed to variation among individuals is known as repeatability and is most frequently estimated as the intraclass correlation coefficient (ICC). The goal of our study is to provide guidelines for determining the sample size (number of individuals and number of measurements per individual) required to accurately estimate the ICC.

2. We report a range of ICCs from the literature and estimate 95% confidence intervals for these estimates. We introduce a predictive equation derived by Bonett (2002), and we test the assumptions of this equation through simulation. Finally, we create an R statistical package for the planning of experiments and estimation of ICCs.

3. Repeatability estimates were reported in 1·5% of the articles published in the journals surveyed. Repeatabilities tended to be highest when the ICC was used to estimate measurement error and lowest when it was used to estimate repeatability of behavioural and physiological traits. Few authors report confidence intervals, but our estimated 95% confidence intervals for published ICCs generally indicated a low level of precision associated with these estimates. This survey demonstrates the need for a protocol to estimate repeatability.

4. Analysis of the predictions from Bonett’s equation over a range of sample sizes, expected repeatabilities and desired confidence interval widths yields both analytical and intuitive guidelines for designing experiments to estimate repeatability. However, we find a tendency for the confidence interval to be underestimated by the equation when ICCs are high and overestimated when ICCs and the number of measurements per individual are low.

5. The sample size to use when estimating repeatability is a question pitting investigator effort against expected precision of the estimate. We offer guidelines that apply over a wide variety of ecological and evolutionary studies estimating repeatability, measurement error or heritability. Additionally, we provide the R package, icc, to facilitate analyses and determine the most economic use of resources when planning experiments to estimate repeatability.


Repeatability is the fraction of total variation in a set of measurements because of the variance among individuals. Estimates of repeatability are useful in a variety of ecological and evolutionary fields and can be obtained from repeated measurements on a number of individuals (Lessells & Boag 1987). In physiological and behavioural ecology, repeatability is often used to measure the extent to which an individual’s performance or behaviour is canalized (Bennett 1987; Lessells & Boag 1987; Boake 1989). Researchers interested in the evolution of traits often use estimates of repeatability as a rough upper limit to the heritability (Boake 1989; Falconer & Mackay 1996; but see Dohm 2002). This is particularly useful in areas such as behavioural ecology, where the inability to obtain accurate heritability estimates under natural conditions has impeded studies of behavioural trait evolution (Boake 1989). In many areas of ecology and evolutionary biology, researchers seek to increase the power of their experiments and the accuracy of their conclusions by minimizing measurement error (Bailey & Byrnes 1990; Yezerinac, Lougheed, & Handford 1992). When considering static or fixed specimens, the percentage of total variation because of measurement error is simply one minus the repeatability. In these cases, repeatability provides a standard means of quantifying measurement error for comparisons among traits and treatment groups.

Estimating repeatability proceeds with k measurements of a trait on each of n individuals. The collected data can then be analysed using a one-way analysis of variance to obtain the variance components (i.e. among-individual variance and within-individual variance – the latter being composed of both error variance and actual trait variation, within individual n, among the k measurements) that are used to estimate the repeatability (Lessells & Boag 1987). Much like other statistics (e.g. mean and standard deviation), the estimated repeatability is merely a sample statistic that estimates a population parameter, which in this case is the true repeatability (Sokal & Rohlf 1995). This evokes the question, ‘how many individuals (n) and how many times does each individual need to be measured (k) to obtain a precise estimate of repeatability?’

In the existing ecological, behavioural and evolutionary literature, there is no standing guideline for determining the sample size (n and k) required to estimate repeatability. This is apparent from the inconsistency of sample sizes used in repeatability estimates across papers coupled with a lack of justification for the particular sample sizes chosen. For example, in a review of repeatability estimates for behavioural traits, Bell, Hankison, & Laskowski (2009) found repeatabilities based on measurements of as few as five and as many as 1318 individuals, with most individuals measured twice, but ranging up to 60 separate times (mean k = 4·41). Although most papers using repeatability cite Lessells & Boag (1987) to indicate how the repeatability calculation was carried out, this instructive paper does not discuss the sampling error of the statistic and hence does not provide guidelines for determining the sample size required to obtain precise estimates. Of course, defining the appropriate level of precision depends upon the question(s), one is trying to answer with the statistic. When assessing measurement error, researchers are seeking to determine the external variance introduced by the act of measurement. Thus, investigators strive to obtain an estimate with the confidence interval width as small as possible. However, other approaches, such as when estimating the repeatability of a particular behavioural response (e.g. thermal preference of Drosophila subobscura; Rego et al. 2010), intend to distinguish the repeatability estimate from some expected value (e.g. zero in the case of demonstrating significant among-individual variation). In the latter case, the precision with which repeatability must be estimated is contextualized by the distance between the two estimates being compared. It is possible to test whether or not a particular repeatability estimate is different from some a priori chosen value (Donner & Eliasziw 1987; Walter, Eliasziw, & Donner 1998), and researchers often use these methods to test the hypothesis that their repeatability estimates are significantly >0. In some contexts, this may be sufficient, but to convey the precision of the estimate, it is more informative and appropriate to report the magnitude of the repeatability with an associated confidence interval (Nakagawa & Cuthill 2007). Little attention has been paid to the precision of repeatability estimates in the ecological, behavioural and evolutionary literature, and confidence intervals are rarely reported. However, this issue has received considerable attention in the medical literature (Donner & Eliasziw 1987; Walter, Eliasziw, & Donner 1998; Bonett 2002), and one of our goals in this study is to introduce the methods derived therein to the behavioural, ecological and evolutionary research community.

We begin by sampling recently published papers in the behavioural, ecological and evolutionary literature to assess the precision of existing estimates of repeatability and the frequency with which confidence intervals are currently reported. From this survey, it is apparent that guidelines are needed to instruct researchers how to design protocols for assessing repeatability with precision. With that goal, we introduce an equation for sample size estimation from Bonett (2002) and use simulations to test the validity of this method for data with diverse variance structures and over a wide range of intraclass correlation coefficients (ICCs). In general, we find that the prescriptive equation from Bonett (2002) agrees well with results from our simulated data. Finally, we conclude the paper by discussing strategies to maximize the precision of estimates while minimizing the total sample size and recommendations for the future reporting of repeatability estimates. In addition, we have created an R software package, icc, based upon the results and methods discussed in this paper. The package provides a set of tools that researchers can use to design an experiment that will maximize precision of repeatability estimation while minimizing effort.

Materials and methods

Repeatability, as a parametric statistic, is more broadly known as the intraclass correlation coefficient (Sokal & Rohlf 1995),

image(eqn 1)

where inline image is the variance among groups or classes and inline image is the variance within groups or classes. The ICC can be estimated from two or more measurements per individual on a number of individuals, or it can be estimated from measurements of a number of individuals per family for a number of family groups where the individuals in the family make up the repeated measures for the group (e.g. full-sib heritability estimates in evolutionary genetics). The definition of the intraclass correlation coefficient states that the variable measured repeatedly is of the same class. By class, it is meant that observations within groups or individuals are arbitrarily ordered and uncategorized. Similarity within a class and independence of measurements between classes, such that the order has no effect on the estimate of the ICC, distinguish this statistic from the interclass correlation coefficient (i.e. Pearson’s r). Although there are other methods to assess repeatability (e.g. Mansour, Nordheim, & Rutledge 1981; Nussey, Wilson, & Brommer 2007; Dingemanse et al. 2009; Pierotti, Martín-Fernández, & Seehausen 2009), in the present paper, we restrict our discussion to the ICC, which is by far the most widely used statistic for this purpose. Our goal is to derive practical guidelines for estimating the ICC with maximum precision and minimum experimental effort.

Literature survey

We used the ISI Web of Science to obtain the total number of articles published in Behavioral Ecology, Ecology, Evolution, Journal of Experimental Biology and Biological Journal of the Linnean Society for 2008–2009. We chose these journals to represent a wide spectrum of research in evolutionary biology and ecology, including behavioural ecology. Our intent was not to provide a comprehensive survey of the literature, but rather to provide a good snapshot of how repeatability is being used in current practice by choosing one major journal in each area. We used Google Scholar to search the above journals for articles published in 2008–2009 that contain the words ‘repeatability’, ‘intraclass correlation coefficient’ or ‘measurement error’. To simplify our analysis, we only took estimates of repeatability and per cent measurement error from papers that used a one-way anova to obtain the variance components used in estimating these statistics. All per cent measurement errors were converted to repeatabilities by subtracting the proportion measurement error from one. We classified the estimates of ICC as follows. If the ICC was estimated to assess the precision of repeated measures of fixed or static specimens (i.e. what we typically think of as measurement error), we classified it as ‘measurement error’. In these cases, differences between measures of the same individual are assumed to be entirely caused by measurement error, not by any changes in the specimens themselves. However, if the ICC represented the repeatability of a trait among individual specimens (e.g. among siblings within a family) or variation within individuals over time (e.g. successive assays of individual behaviour), then these estimates were classified according to the type of trait measured (i.e. ‘morphological’, ‘behavioural’ or ‘physiological’).

For each estimate of the ICC, we obtained the n, k, F statistic and degrees of freedom from the anova, and the confidence limits, if reported. If confidence limits were not reported, but the F statistic, degrees of freedom and k were, we calculated the upper and lower 95% confidence limits using eqn 1 from Bonett (2002). If neither confidence limits nor the F statistic with degrees of freedom was reported, but the standard error of the ICC estimate was, we calculated the width of the 95% confidence interval as inline image, where inline imageis the point on a standard normal distribution exceeded with probability inline image (i.e. inline image1·96 for a 95% confidence interval) and inline image is the ICC estimator.

Sample size required for estimating the ICC

Experiments in some fields are less focused on testing a particular hypothesis (e.g. whether a behaviour has a heritable basis) and are more focused on estimating specific biological parameters (e.g. using estimates of heritability to predict the response to selection using the breeders’ equation; Lynch & Walsh 1998). In the latter case, researchers are concerned with minimizing the uncertainty of their parameter estimates. Previous work, particularly in the evolutionary genetics literature, has discussed sample size estimation for the ICC, but these studies deal with the power necessary for hypothesis testing rather than minimizing the confidence interval around estimates of the ICC (Robertson 1959; Lynch & Walsh 1998). Bonett (2002) provides a prescriptive equation that allows researchers to determine the number of individuals (n) that will yield a confidence interval of a fixed width (w) for a given ICC (inline image) with a given number of measurements per individual (k):

image(eqn 2)

Here, w is approximated by inline image and the variance of the ICC estimator (inline image) is approximated by (Fisher 1954):

image(eqn 3)

Equation 2 can be employed a priori to design protocols for estimating the ICC with a desired level of confidence.

Thus far, we have only dealt with ICCs estimated from a one-way anova, where the n individuals and the k measurements are each randomly sampled from a population of values. As a side note, we point out that the ICC can also be calculated from a two-way random-effect anova or a two-way mixed-effect anova (McGraw & Wong 1996). Equation 2 can also be applied in these latter two cases (Bonett 2002).

To determine whether application of eqn 2 yields the expected confidence intervals for data with diverse variance structures and over a wide range of ICCs, we simulated ICCs by Monte Carlo sampling. Sampling in iteration i (where = 1, 2, … 10 000) proceeded such that each unique combination of parameters c [expected ICC (0·2, 0·4, 0·6, 0·8, 0·9 or 0·95), k measurements per individual (2, 3, … 10) and desired confidence interval width (95% CIW = 0·1, 0·2 or 0·3)] was entered into eqn 2 to yield the sample size estimator ni,c. We then simulated a data set of ni,c individuals, each with ki,c measurements per individual. For an iteration of any c combination of parameters, the ki,c measurements for a particular individual were picked from a random normal distribution with a constant variance. The means of the random normal distributions, from which ki,c measurements were drawn, each represented one of the ni,c values. Each of the ni,c values was drawn from a random normal distribution. The variance of this distribution was calculated by rearranging eqn 1 to solve for the among-group variance, inline image (where the ICC is the expected value from eqn 2 and the within-group variance, or inline image, is the variance used to generate the random normal distributions for the ki,c values).

Therefore, any combination c utilized ni,c individuals, each with ki,c unique values. We then calculated the ICC from the simulated data using a one-way anova to estimate the variance components (using function ‘ICCest’ in the icc package) and repeated the process 9999 times. This yielded 10 000 simulated ICC estimates for every unique combination (c) of k, w and ICC. Using each of the c distributions of ICCs, we estimated the realized 95% confidence interval. We could then estimate the proportional difference, introduced by the assumptions in eqn 2, by comparing the realized CIWs with the expected CIWs (i.e. desired CIW entered into eqn 2) as follows: proportion difference = (realized CIW − expected CIW)/expected CIW.

To study the optimal sampling design given a fixed researcher effort, we algebraically manipulated eqn 1 to solve for the F statistic, at a given ICC and k (for a derivation see Appendix 2 in Lessells & Boag 1987), to be used in the construction of exact confidence limits according to eqn 1 of Bonett (2002). We then solved for the upper and lower confidence limits at three ICCs (0·1, 0·5 and 0·9) and three levels of fixed researcher effort (30, 60 and 120 total measurements), restricting the analyses to combinations of n and k where n > 3. All simulations and analyses were conducted in the statistical software R (R Development Core Team 2011).

icc– An R package for researchers estimating repeatability

We created the R package icc to facilitate the simulation and analyses conducted in this work as well as to aid researchers in planning experiments to estimate the ICC with maximum precision. Additionally, the package contains a function to estimate repeatability from a one-way anova. The project web page can be viewed on R-Forge by visiting the following website: Further, the package can be downloaded and run in R using the following commands:

install.packages(‘ICC’, repos=getOption(‘repos’))

and selecting the closest CRAN mirror site. Alternatively, the package can be downloaded from the R-Forge website using the following commands in R (note to install a package from R-Forge, the most currently available version of R must be installed):

install.packages(‘ICC’, repos=’’)

The package icc contains the following functions (for more descriptions, see the R documentation for the package):

  •  ICCest – It estimates the ICC, confidence intervals, n, k and variance components using a one-way anova
  •  Nest – Given a predicted ICC and k measures per individual/group, this function will calculate the n individuals/groups required to obtain a desired confidence interval width (Bonett 2002). Inputs can either be vectors of hypothetical values or a pilot data set, from which input values will be calculated and implemented.
  •  ICCbare – A function that only estimates the ICC from a one-way anova. The function and calculation of the ICC have been made as efficient as possible to reduce computation time in Monte Carlo simulations or bootstraps.
  •  effort – It plots the k measures per individual/group, based upon a fixed total researcher effort. The minimum of this curve is the optimal k to use for a given fixed total effort.


Literature survey

We found 48 papers in which the ICC was used to estimate measurement error or the repeatability of morphological, behavioural, physiological or life-history traits (Table 1, Fig. 1). Only two papers reported repeatability for life-history traits (three traits), and so we did not include this category in Fig. 1. In general, estimates of the ICC used to assess measurement error are closer to one (mean = 0·91, median = 0·96, range = 0·47–1, number of estimates = 53; Fig. 1) than estimates of trait repeatabilities, which are more evenly distributed throughout the range of possible ICC values. Morphological traits tend to show the highest ICCs relative to other trait types (mean = 0·65, median = 0·68, range = 0·18–0·98, number of estimates = 19), whereas physiological and behavioural traits are somewhat lower and more dispersed (physiological: mean = 0·32, median = 0·30, range = −0·18–0·95, number of estimates = 10; behavioural: mean = 0·48, median = 0·48, range = −0·2–1, number of estimates = 174). The estimates used in Fig. 1 represent 46 papers that reported ICCs (excluding the two papers that estimate the repeatability of life-history traits), including multiple estimates per paper. To ensure that the observed trends were not influenced by pseudoreplication (using multiple estimates per paper), we repeated this analysis using only the minimum estimate for each paper and again using only the maximum estimate per paper (results not shown). These more restricted analyses yielded the same conclusions, and so we include only the full data set here.

Table 1.   Frequency of research articles reporting repeatability estimated from the intraclass correlation coefficient (ICC) in 2008 or 2009, for select journals.
ArticlesBehavioral EcologyEcologyEvolutionJournal of Experimental BiologyBiological Journal of the Linnean SocietyTotal
Total number36572653112034313256
Number reporting ICC212145648
Per cent reporting ICC5·8%0·3%2·6%0·4%1·4%1·5%
Figure 1.

 A frequency distribution of repeatabilities reported in research articles published in 2008–09, separated by trait type (see text for explanation). The x-axis shows repeatability estimated as the intraclass correlation coefficient (ICC).

Of the 48 papers that included estimates of the ICC, only two reported confidence limits about the ICC estimates. From an additional 24 papers, we were able to estimate the 95% confidence limits using the reported F statistic and degrees of freedom. Additionally, we were able to estimate approximate 95% confidence limits from another two papers that reported the standard errors for their ICC estimates. The resulting distribution of confidence intervals is shown in Fig. 2, binned according to the estimated ICC. The figure illustrates that for ICCs greater than about 0·3, the 95% confidence interval widths rarely overlap zero, indicating repeatabilities statistically significantly different from zero. Nevertheless, the confidence intervals are often large. For example, the median confidence interval is approximately 0·3 for ICCs of 0·2 and 0·3, and overall 12% of estimates had confidence intervals >0·5. This clearly indicates that many published estimates of the ICC are not very precise. Using either the minimum or the maximum estimates per paper produced similar results (not shown).

Figure 2.

 The relationship between repeatability (estimated as the intraclass correlation coefficient – ICC) and the precision with which each estimate is made. Precision is indicated by the width of the 95% confidence interval, shown on the y-axis. Lower values signify greater precision. The horizontal bar across each box indicates the median 95% confidence interval width while the box itself illustrates the inter-quartile range (with the whiskers representing 1·5× inter-quartile range and outliers depicted as open circles). The solid and dashed lines indicate locations where the lower confidence limit may include zero (points above and to the left of the solid line) or one (points above and to the right of the dashed line), respectively. Numbers along the top indicate the number of estimates (sample size) for each box.

Sample size estimation

We implemented eqn 2 over a range of combinations of CIW and number of measurements per individual (k) to determine the sample size (n and k) required to estimate the ICC with specific levels of precision. These results are tabulated for convenient reference in Appendix Table 1 and can be generated using the ‘Nest’ function in the icc package. The general shape of the relationship between n, k and ICC is illustrated in Fig. 3 for a confidence interval width of 0·2. Figure 3 also shows the sample sizes (n and k) used in the 48 papers that reported ICC estimates. Points below the line of corresponding repeatability indicate sample sizes that would result in confidence intervals >0·2. The majority of points cluster within the range of n < 60 and k = 2, which is sufficient to yield a confidence interval of <0·2 only for ICCs of 0·9 or more. These comparisons suggest that the sample sizes used may frequently produce imprecise estimates of the ICC. However, the opposite is also evident: points far above the line indicate sample sizes that may be larger than necessary and hence could reflect wasted effort. Again, the reason for which the repeatability is being estimated will guide the decision about what is or is not a sufficient sample size. From Appendix Table 1, we can see that the sample sizes typically used (< 60 and k = 2) would yield confidence intervals of at least 0·2 for ICC = 0·8, 0·3 for ICC = 0·7 and 0·4 for ICC = 0·5, reinforcing the earlier message that ICCs are being estimated with considerable uncertainty. As a general rule, the sample size needed for a given confidence interval decreases as the ICC increases, as does the influence of number of measurements per individual (k). If the ICC is ≥0·9, there is little to be gained by measuring each individual more than twice, and for an ICC of 0·8, k appears to level out at three. At the opposite extreme, for an ICC of 0·2, precision continues to increase at least up to 10 measurements per individual. This makes intuitive sense: if the ICC is low, a high proportion of the sample variance is owing to variance within individuals (e.g. uncanalized behavioural responses) and more measurements per individual are required to estimate this variance.

Figure 3.

 The sample size (n, k) necessary to estimate the intraclass correlation coefficient (ICC) with a 95% confidence interval width of 0·2. Open circles depict the actual sample sizes from published estimates of the ICC in 2008 and 2009, with the colour of the point corresponding to the line with the closest repeatability.

The preceding guidelines resulting from the interpretation of Figs. 2 and 3 assume that no constraints act to limit the total number of measurements that can be obtained. In practice, constraints on total researcher effort (× k) often restrict the sampling design for estimating repeatabilities. We therefore estimated the confidence interval widths for all possible combinations of n and k at three levels of fixed researcher effort at each of three ICCs (Fig. 4). For a fixed level of effort, there is clearly an optimal k, a number of measurements per individual that produces the smallest confidence interval around the ICC estimate. The optimal k declines as the ICC increases, as we would predict from our previous results (comparing across panels a, b and c of Fig. 4). Although deviating from the optimum k in either direction increases the confidence interval, measuring fewer times than the optimum seems to have a slightly more detrimental effect than measuring more, especially when the ICC is low. The total effort has little effect on the optimal k but strongly influences the width of the confidence interval, especially for low ICCs (compare different line types within a panel of Fig. 4).

Figure 4.

 Confidence interval widths (CIW) as a function of number of measurements per individual (k) for three levels of total research effort. Results are plotted for intraclass correlation coefficient (ICC) values of 0·1 (a), 0·5 (b) and 0·9 (c). Effort is defined as n × k, where n is the number of individuals measured, and the levels shown are 30 (dashed lines), 60 (solid lines) or 120 (dotted lines). Similar plots can be reconstructed, for many different parameter combinations, by implementing the ‘effort’ function in the icc package.

In general, eqn 2 provides a reliable method for sample size estimation over most values of the ICC. However, the proportional difference of the confidence interval about our simulated ICCs increased at the extremes of the ICC (Fig. 5). Simulations from the prescribed n yield confidence intervals that were larger than expected when the ICC was high, especially for high values of k. On the contrary, the confidence intervals were smaller than expected when the ICC was low, especially for low values of k. As a consequence of the two assumptions in eqn 2, the width of the confidence interval tends to be underestimated by the equation when repeatability is high and overestimated when repeatability is low.

Figure 5.

 The proportional difference of realized confidence interval widths (CIW) from expected CIWs [(simulated CIW − expected CIW)/expected CIW] for different combinations of intraclass correlation coefficient (ICC), k measurements per individual and expected CIW. The red, horizontal line indicates no difference between the simulated and expected CIW. Simulated confidence intervals were constructed by selecting the 250th (2·5%) and 9750th (97·5%) largest values of the 10 000 simulated ICCs for each combination of parameters.


Our review of selected journals in ecology, behaviour and evolutionary biology indicated that the sample sizes (n individuals or groups and k measurements per individual or group) being used to estimate the intraclass correlation coefficient (ICC) are not derived from any formal guidelines nor did they seem adequate to obtain precise estimates in many cases. The equation developed by Bonett (2002) allows researchers to estimate the sample size required (n and k) to estimate the repeatability with precision, but use of this method has not penetrated the literature in ecological, behavioural and evolutionary biology. To facilitate its implementation, we have created an R package, icc, that estimates repeatabilities and their confidence intervals (‘ICCest’ function) as well as several functions that aid researchers in designing experiments to estimate repeatability with minimum effort. Equation 2, Appendix Table 1 and Figs. 3 and 4 provide analytical, tabular and visual guides for sample size estimation. As a simple guide for experimental design, increasing n at high repeatabilities and increasing k at low repeatabilities are the most efficient ways to decrease the size of the confidence interval about a repeatability estimate. In general, the number of measurements required per individual (k) increases as the expected ICC declines, ranging from an optimal k of two for ICCs around 0·9 to between six and eight for ICCs of 0·1. Confidence intervals are seldom reported for repeatability estimates, but we urge authors to estimate these and report them with their repeatability estimates in the future. Methods for deriving the confidence limits about repeatabilities can be found in Donner & Wells (1986), Becker (1992), McGraw & Wong (1996) and Kistner & Muller (2004). The methods of most use to a general audience have been included in the icc package.

It is not surprising to find disagreement, at the extremes of ICC estimates, between our realized confidence intervals and the expectations in eqn 2. To better understand the discrepancy in confidence interval width, we must first understand the effect of sample size (n and k) on the ICC estimate. Ratios composed of statistics, where the numerator and denominator are each sampled with error, are often biased compared to the true population value (Lynch & Walsh 1998). As Ponzoni & James (1978) state, ‘the expectation of a ratio does not equal the ratio of expectations’. Estimates of the ICC based on low sample sizes, where the numerator or denominator are estimated with greater error, yield biased estimates of the ICC, which in turn introduce error in the estimate of the confidence interval. The ICC variance approximation used to construct confidence intervals is expected to be unbiased only when n is >30 (Donner & Koval 1983). Ponzoni & James (1978), Wang, Yandell, & Rutledge (1991) and Visscher (1998) discuss the biased estimates of the ICC in the context of heritabilities, where a particular family group is the class. Estimates of heritability are two and four times the ICC (or the intra-family correlation) in full-sib and half-sib breeding designs, respectively. Thus, an extremely large heritability estimate of 0·9 only represents an ICC estimate of 0·45 in a full-sib design and half this value for half-sib designs. Although these authors provide a correction for bias in ICC estimates, they investigate this only within the range of typical heritabilities. Unfortunately, they do not consider the applicability of their correction to large ICC estimates (i.e. ICCs > 0·8).

Broadly speaking, the intermediate values in the distribution of our simulated ICC estimates yielded similar confidence intervals to those approximated using eqn 3. However, at higher ICCs, the simulated ICC distributions tended to give larger confidence intervals than expected while estimates at the lower end of the ICC and k values investigated give smaller confidence intervals than expected. This has practical consequences when implementing eqn 2 in the design of experiments, the most serious of which occur when the CIW is underestimated in this equation (i.e. higher ICCs). If sample sizes for a study are chosen a priori, based on eqn 2, researchers may find their actual estimate to have a larger confidence interval than intended. In the interest of both parameter estimation and hypothesis testing, it is therefore important to increase the sample size (mainly n) accordingly. However, at the other end (lower ICCs), the predictive equation often prescribes an n that will yield an ICC estimate with a confidence interval smaller than what was expected. At these lower ICCs, confidence intervals simulated with optimal k values tend to be only slightly smaller than the expected CIW. However, this difference grows (realized CIWs become smaller than the expected) as the k decreases away from the optimal value.

Our results have demonstrated that for a given ICC, the same confidence interval can be derived from several different combinations of n and k. Choosing which level of k to use in combination with the resulting value of n is often a function of the constraints of the experimental system. For example, the number of subjects that can be measured (n) is often limited in studies requiring invasive methods or measurement of traits that are subject to ethical restrictions (e.g. infanticide in voles Poikonen et al. 2008). Under that circumstance, researchers may opt for the combination of n and k that minimizes n. Conversely, in behavioural experiments, the test subjects may become over- or desensitized to the testing stimuli if the test is repeated too often, which can limit the number of trials (k) in a given experimental design. Because of this, Bell, Hankison, & Laskowski (2009) recommend designing behavioural experiments to measure more individuals (n), with fewer measurements per individual (k). Thus, behavioural researchers would likely opt for the combination of n and k that minimizes k. It is important to note the distinction between our recommendation and that of Bell et al. with respect to the value of k to use. Bell et al. tested for a bias in the reported repeatability caused by organisms becoming habituated or sensitized to the behavioural assays (Martin & Reale 2008), by regressing the repeatability estimate (actually the effect size in Bell et al.) on the k measures per individual. Because they found no effect of k on the reported values of repeatabilities, they reasoned that the total effort should be devoted to increasing the number of individuals measured (n) rather than the number of measurements per individual (k). However, our findings predict an effect of k on the confidence interval. Therefore, when reducing k, researchers should consult eqn 2 and our tables and graphs to determine the n required to estimate repeatability with a desired level of confidence.


We thank D.A. Roff for his great patience and suggestions throughout discussions of this paper; however, any faults in this paper are our own. Similarly, many thanks go to K.M. Middleton for helpful suggestions about creating R packages. Additionally, we thank T.F. Hansen and two anonymous reviewers for providing comments that greatly improved the manuscript. This work was supported by the National Science Foundation through a grant (DEB-0743166) to D. J. F. and a Graduate Research Fellowship Position to M. E. W.


Table 1. The n individuals required to estimate the intraclass correlation coefficient (ICC) with a 95% confidence interval width (CIW), for k observations per individual. Estimates are obtained from the ‘Nest’ function in the icc package (which is based upon eqn 2 in the text).

 ICCn– number of individuals measured
k = 2k = 3k = 4k = 5k = 6k = 7k = 8k = 9k = 10k = 11k = 12k = 13= 14
CIW = 0·10·1150859935224518815313011410192857974
CIW = 0·20·13781518962483934302624222120
CIW = 0·30·1169684029221816141312111010
CIW = 0·40·196392317131110987766