Richard F. Preziosi, Faculty of Life Sciences, 3.614 Stopford Building, The University of Manchester, Oxford Road, Manchester M13 9PT, UK. Tel.: +44 (0)161 2755959; fax: +44 (0)161 2753938; e-mail: email@example.com
Advanced techniques for quantitative genetic parameter estimation may not always be necessary to answer broad genetic questions. However, simpler methods are often biased, and the extent of this determines their usefulness. In this study we compare family mean correlations to least squares and restricted error maximum likelihood (REML) variance component approaches to estimating cross-environment genetic correlations. We analysed empirical data from studies where both types of estimates were made, and from studies in our own laboratories. We found that the agreement between estimates was better when full-sib rather than half-sib estimates of cross-environment genetic correlations were used and when mean family size increased. We also note biases in REML estimation that may be especially important when testing to see if correlations differ from 0 or 1. We conclude that correlations calculated from family means can be used to test for the presence of genetic correlations across environments, which is sufficient for some research questions. Variance component approaches should be used when parameter estimation is the objective, or if the goal is anything other than determining broad patterns.
The estimation of genetic parameters such as heritabilities and correlations are central to our understanding of evolution. While predicting evolutionary change is a common goal in both theoretical and applied studies, it is rare that information about both selection and genetics is available for natural populations (Grant & Grant, 1995). The logistically complex designs and advanced analytical methods used in the estimation of genetic parameters are often necessary to achieve even reasonable predictions of multivariate evolution. In contrast, for many evolutionary studies, of laboratory or natural populations, it is of interest to simply determine the sign of a genetic correlation between traits. However, concerns about the theoretical assumptions and the complexity of analyses often discourage nonspecialists from pursuing quantitative genetic studies. In certain circumstances, widely used or practical substitutions for the theoretically more accurate solution have been shown to be informative. This is especially true if researchers are interested in comparing broad patterns rather than precise parameter estimation (Arnold, 1994). For example heritability estimates obtained in laboratory studies are often considered inflated relative to heritability in nature due to differences in environmental heterogeneity (Weigensberg & Roff, 1996); (see review by Hoffmann & Merilä, 1999). While field estimates are preferable to laboratory estimates, the former are not always practical. Weigensberg & Roff (1996) show that laboratory estimates of heritability can be reasonable approximations of heritability in the wild. Another example is the determination of genetic correlations among traits. In circumstances where genetic data are unavailable or impractical to collect, Cheverud (1988) suggests that it possible to use phenotypic correlations as approximations to genetic correlations to make evolutionary inferences. However, estimates of genetic and phenotypic correlation may differ due to the effects of heritability, environment, the type of traits considered, and sampling (Cheverud, 1988; Willis et al., 1991). Yet analyses of empirical data suggest that, with certain qualifications, phenotypic correlations do approximate genetic correlations (Cheverud, 1988; Roff, 1995; Lynch, 1999). The issue is not whether one estimate is superior to another, but rather what are the circumstances under which simpler substitutes are permissible?
Calculation of cross-environment genetic correlations is an example where a practical estimation method may be used over a theoretically more robust, but less accessible method. Cross-environment genetic correlations have important implications for evolutionary studies. Falconer (1952) was the first to suggest that a trait measured in two environments can be considered as genetically correlated traits or ‘character states’ (sensuVia & Lande, 1985). When a population is still evolving in either environment (i.e. is suboptimal) then the genetic correlation between environments (character states) will influence the evolution of the trait in both environments (Via, 1984). Via & Lande (1985) also pointed out that the genetic correlation affects the rate at which a population will evolve to express the optimum character state in all environments. The higher the genetic correlation the slower the approach to the optima (but see Pigliucci, 1996) and the less independent (more constrained) trait evolution in each environment will be. Via & Lande (1985) also showed that it is only when the genetic correlation is equal to 1, or –1, that the correlation in itself prevents optimization in all environments.
A genetic correlation can be estimated using either correlation among family means or by estimating variance components from anova [least squares (LS) anova or restricted error maximum likelihood (REML)]. Each of these methods has particular weaknesses including bias of estimation (Shaw, 1987; Fry, 1992; Roff & Preziosi, 1994; Windig, 1997; Dutilleul & Carriere, 1998), precision of the estimator (Windig, 1997), power to detect a difference from specific values (Shaw, 1987; Fry, 1992; Roff & Preziosi, 1994; Windig, 1997), and the occurrence of zero value or negative variance components leaving the genetic correlation undefined (Shaw, 1987; Windig, 1997). The performance of difference estimates in relation to each of these issues therefore depends on sample size, experimental design, distribution of data within and between experimental factors, and the actual value of the correlation. Given the vagaries of biological data, there is not a single method that is best under all circumstances. It is therefore worthwhile to define the conditions under which the methods are acceptable.
The ease of calculation and the frequent continued use of correlations calculated from family means in published studies make this method very attractive in terms of convenience and comparability. Yet ease of use is not, by itself, a compelling reason to use a method. Theory and evidence from simulations suggest that in some circumstances the family mean correlation can seriously underestimate the genetic correlation estimated by other methods (Rausher, 1984; Via, 1984; Fry, 1992; Carriere et al., 1994; Windig, 1997). Knowledge of how family mean correlations perform against other estimators in general would be useful in determining what, if any, conclusions can be made from them. However the variety and complex nature of the factors that affect different estimators make it difficult to simulate the entire range of possible data types that may be encountered. A complementary approach is to compare the multiple measures of genetic correlation results presented within the same paper using the same data. This provides an excellent opportunity to compare family means against other estimates using empirical data. Measures based on empirical data have a variety of distributions, derive from different experimental designs, differ in sample sizes and other vagaries that are, by definition, encountered within experimental situations. Comparing measures derived from empirical data makes no assumption regarding the distributions of the data, as is the case when data are simulated.
In this study we attempt to define the conditions where the common and easily obtained estimate of across-environment genetic correlations calculated from family means are reasonable approximations of cross-environment genetic correlations calculated from anova methods. We investigated this question using previously unpublished data from our own laboratories and data gleaned from previously published work (Appendix 1). The data from our laboratories consist of a full-sib study of the ladybird Harmonia axyridis and a half-sib study of the cockroach, Nauphoeta cinerea. While our original goal was to compare published measures, we included data from our own work to expand the number of independent data sources included in our analysis.
We measured the overall agreement between genetic correlation estimates and the effect of family size on the relationship between family mean correlation and the collected assortment of variance component methods. Available theory leads to the prediction that genetic correlation estimated from family means should be lower than the actual genetic correlation and that different estimates should converge as family size increases (Rausher, 1984; Via, 1984). We compare our findings to these expectations and discuss the utility of calculating genetic correlations from family means to answer different questions.
Materials and methods
Full-sib example: Harmonia axyridis
We obtained ladybirds from a commercial supplier and raised these in the lab for at least one generation before use in experiments. We used a full-sib, split-family design to estimate quantitative genetic parameters for female progeny (Falconer & Makay, 1997). After larvae hatched we randomly assigned them to a diet treatment and allowed them to feed ad libitum.
The diet treatments both consisted of the aphid Acyrthosiphon pisum but differed in the host plant of the aphid. In the first diet treatment we raised aphids exclusively on sugar snap peas (Pisum sativum var. ‘macrocarpon Dwarf sweet green’; hereafter pea) and in the second diet we raised aphids exclusively on faba beans (Vicia faba var. ‘Windsor White’; hereafter bean). We used the same clone of aphid in both treatments. We calculated full-sib heritabilities (Becker, 1992) and estimated five different genetic correlations for each trait based on family means or different combinations of variance components. The procedure for each is given below.
where CovE1,E2 is the covariance between the full-sib family means in environments E1 and E2, and VE1 and VE2 are the variances among family means within each environment. We estimated these correlations by calculation of Pearson product-moment correlations of the family means from each environment (Sokal & Rohlf, 1995).
Method 2: variance components from separate LS anovas (Windig, 1997)
We used the formula suggested by Windig (1997) to calculate the genetic correlation:
We calculated VF, the variance component due to overall family effects, from a two-way anova (Lynch & Walsh, 1998):
where MSF is the family mean squares (MS), MSF*E is the family by environment MS, n is approximately the number of offspring per family per environment calculated using the general formula for n when group sizes are unequal (Sokal & Rohlf, 1995) and E is the number of environments. We calculated VF,E1, VF,E2, the family variance components in environment E1 and E2, using separate one-way anovas of individuals from one environment (as for heritability estimates):
where MSF,Ei is the family MS, MSErr,Ei is the MS associated with the residual variance, and nEi is the mean family size in the ith environment.
Method 3: variance components from two-way anova (Yamada, 1962)
We calculated the genetic correlation using the interaction variance component from a two-way anova. This method is sensitive to both the effect of crossing reaction norms and the interaction that occurs due to differences in variance in the two environments (Lynch & Walsh, 1998). We calculated the genetic correlation by this method as:
where VF is the variance component due to overall family effects calculated as in eqn 3, and VF*E is the variance component of the interaction between family and environment, calculated as:
where MSF*E is the family-by-environment MS, MSErr is the error term MS, and n is the number of offspring per family per environment.
Method 4: variance components from separate REML estimates (Shaw, 1987)
We used variance components to calculate the genetic correlation as in eqn 2. In this case the variance components are all estimated from separate REML analyses. We obtained VF by analysis of data from both environments, and environment specific variance components using REML estimates based on data from environments E1 and E2, respectively.
Method 5: variance components from a single REML analysis
In this method we used eqn 5 to calculate the genetic correlation as in method 3 above. In this case we estimated VF and VF*E from a single REML analysis performed on data from both environments.
Half-sib example: Nauphoeta cinerea
We used cockroaches from an mass colony that has no detectable inbreeding (Corley et al., 2001). In this case we used a half-sib, split-family design to estimate quantitative genetic parameters for female progeny (Falconer & Makay, 1997). Nymphs were raised at either 20 or 27 °C.
We calculated standard half-sib heritabilities (Becker, 1992) and estimated five different genetic correlations for each trait. Details of the difference between calculation of the genetic correlation for the half-sib and full-sib designs are given below.
Method 1: family mean correlation
We calculated the family mean correlations in the same way as given in method 1 for full-sibs. In this case correlations were based on sire family means, calculated by taking the mean of individuals within dams and then the mean of dams within sires.
Method 2: variance components from separate LS anovas (Via, 1984)
We used eqn 2 to calculate the genetic correlation as in method 2 for full-sibs. In the case of a half-sib study, the family variance components in eqn 2 are replaced by among sire variance components. We calculated the overall among sire variance component (VS) from MS of a two-way anova by:
where MSS is the sire main effect MS, MSD(S) is the dam nested within sire MS, MSS*E is the sire by environment MS, MSD(S)*E is the interaction between dam nested within sire and the environment MS, n is the number of offspring per dam per sire, and E is the number of environments. The environment specific among sire variance components (VS,E1 and VS,E2) are the variances of the sire effects from one-way anovas based on data from each environment, calculated as:
where MSS,Ei is the sire MS, MSD(S),Ei is the dam within sire MS, nEi is the mean family size per dam in the ith environment and D is the number of dams per sire.
Method 3: variance components from two-way anova (Yamada, 1962)
We used the overall sire variance component eqn 7 along with the sire-by-environment interaction component (VS*E) to estimate the genetic correlation as in eqn 2. The family variance components in eqn 2 were replaced by sire variance components. We calculated VarS*E from a two-way anova as:
where MSS*E is as described in eqn 7, MSD(S)*E is the MS for the interaction between dam nested within sire and the environment, n is the mean family size per dam per environment and D is the number of dams per sire.
Method 4: variance components from separate REML estimates
This method uses sire variance components to calculate the genetic correlation in the same way as in eqn 2. We estimated all variance components from separate REML analyses. We obtained VS by analysis of data from both environments and environment specific variance components using REML estimates based on data from environments E1 and E2.
Method 5: variance components from a single REML analysis
This method uses eqn 5 to calculate the genetic correlation. VF and VF*E in eqn 5 are replaced with the appropriate sire variance components (VS and VS*E). We estimated these from a single REML analysis performed on data from both environments.
For both studies we performed all anovas and REML analyses using code written in S-Plus 6 Pro (Insightful, 2001). Variance components in the REML analyses were obtained using the VARCOMP function. We used SYSTAT 10 for Windows (SPSS, 2000) to perform all other analyses.
Comparison of published estimates of genetic correlations
We restricted our comparison to data from published studies reporting estimates of genetic correlation across environments based on both family means and a method that used variance components. Studies may have estimated variance components by LS anova or REML (details are included in the Appendix 1). This allowed us to directly compare different estimates based on the same data. We searched for papers in ISI Web of KnowledgeSM (Thomson 1S1, Alexandria, VA, USA) using combinations of keywords (genetic correlation, across environment correlation). We also checked titles of papers from the last 10 years of the journals Evolution and Heredity. In two studies (Roff & Bradford, 2000; Kause & Morin, 2001) the family mean correlations used in our analyses were not published in the original article but were kindly provided by the authors. Finally, we added data from two studies conducted in our labs. For the ladybirds we estimated correlations for six traits in each sex for a total of 48 comparisons, however we only present detailed analysis for one morphological and one life historical trait. For the cockroaches detailed results are presented for all of the estimates used.
We used Pearson product-moment correlations to compare estimates of cross-environment genetic correlation obtained by family means with all variance components methods. Because error variance exists for both estimation methods, model II (orthogonal) regression was used to estimate the slope of the line describing the relationship between correlation estimates (Sokal & Rohlf, 1995). The ratio of error variances used for the model II regression is proportional to the ratio of univariate variance estimates and produces a slope equivalent to the first principal component of the bivariate data. We arbitrarily chose the family mean estimate as the independent variable and therefore a slope greater than 1 means that estimates from variance component methods are greater than family mean estimates and vice versa. We further investigated the agreement between estimates for subsets of data consisting of estimates from full- and half-sib experimental designs. We also used Pearson correlation to test the relationship between mean family size and the absolute magnitude of the difference between correlation estimates.
Harmonia axyridis– heritabilities and relationship between genetic correlation estimates
We found highly significant family main effects from anova (Development, F20,171 = 22.115, P < 0.001; Pronotum, F19,167 = 7.829, P < 0.001). These family effects are reflected in the full-sib estimates of heritability for these two traits in both environments (Table 1a). Full-sib heritabilities for ladybirds were all high. Genetic correlation estimates of ladybird pronotum width by different methods ranged between 0.366 and 0.639 (Table 2a). For both pronotum width and development time, genetic correlations derived from family means gave the lowest correlation estimate.
Table 1. (a) Means and full-sib heritabilities for H. axyridis traits calculated from variance components obtained from REML and LS anova estimates. (b) Means and half-sib heritabilities for N. cinerea traits calculated from variance components obtained from REML and LS anova estimates.
Pronotum (mm) pea
Pronotum (mm) bean
Development (days) pea
Development (days) bean
REML, restricted error maximum likelihood; LS, least squares.
Pronotum (mm) 20 °C
Pronotum (mm) 27 °C
Development (days) 20 °C
Development (days) 27 °C
Table 2. (a) Estimates of variance components and genetic correlations by different methods (a) H. axyridis, (b) N. cinerea.
Genetic correlation estimates are: FM, Pearson correlation of family means (method 1); REMLSep, REML estimates from environment specific variance components (method 4); REMLCom, REML estimates from combined environment variance components (method 5); LSSep, least squares estimates from environment specific variance components (method 2); LSCom, least squares estimates from combined variance components (method 3); n, number of families (H. axyridis) or sires (N. cinerea).
Variance components for pronotum in both (a) and (b) have been multiplied by 100.
(a) H. axyridis
Pronotum n = 20
Dev n = 21
(b) N. cinerea
Pronotum n = 18
Dev n = 21
Nauphoeta cinerea– treatment effects and the relationship between genetic correlation estimates
We found that the sire main effects from anova were highly significant (Development, F17,394 = 16.200, P < 0.001; Pronotum, F17,394 = 2.568, P < 0.01). These sire effects are reflected in the half-sib estimates of heritability for these two traits in both environments (Table 1b). Half-sib heritabilities for cockroaches were all moderately high or high. Genetic correlation estimates for pronotum size across environments in cockroaches ranged from −0.059 to 0.263 (Table 2b). Estimates of family mean genetic correlations gave notably lower estimates than all other estimates.
Published studies – relationship between genetic correlation estimates
Overall, we found that family mean and variance component estimates of genetic correlation were highly related to each other (Pearson correlation, n = 162, r = 0.77). The slope of the model II regression line for the whole data set showed that family mean correlation estimates were lower than variance component estimates (α = −0.037, β = 1.556, 95% CI, LL = 1.364, UL = 1.776, Fig. 1). We also found a good relationship between estimates when we considered subsets of data including only full- or half-sib studies (Pearson correlations: Full-sib, n = 95, r = 0.89; Half-sib, n = 67, r = 0.80). The slopes of lines describing these subsets showed that there is a small discrepancy between full-sib estimates, and a much larger discrepancy between half-sib estimates (full-sib α = 0.020, β = 1.241, 95% CI, LL = 1.113, UL = 1.384; half-sib α = −0.124, β = 2.113, 95% CI, LL = 1.744, UL = 2.560, Fig. 1a).
We also tested for bias due to the number of estimates from each study by weighting estimates from each study by the inverse of the number of estimates from that study. The slope of the weighted model-II regression did not differ significantly from the unweighted regression and was still significantly greater than 1 (α = 0.038, β = 1.420, 95% CI, LL = 1.314, UL = 1.534).
Due to the large number of comparisons obtained from two studies (Via & Conner, 1995; our studies) we checked that our findings were robust when data from these studies were removed. The relationship between correlation estimates reported in Via & Conner (1995) was not as strong as the relationship found between estimates from remaining studies although correlations were significant for both subsets (Pearson correlation: Via & Conner, n = 54, r = 0.79, χ2 = 50.874, P < 0.001; Others, n = 108, r = 0.86, χ2 = 140.442, P < 0.001). Similarly the relationship between correlation estimates from one of our data sets (ladybirds) was not as strong as the relationship found between estimates from other studies. Again correlations in both subsets were highly significant (Pearson correlation: ladybirds, n = 48, r = 0.58, χ2 = 18.786, P < 0.001; others, n = 114, r = 0.82, χ2 = 126.687, P < 0.001).
Although not significant, we found a weak negative relationship between family size and the difference between correlation estimates (Pearson correlation, n = 162, r = −0.121, χ2 = 2.352, P = 0.125, Fig. 2).
Some authors have advised caution against any use of family mean correlation because simulations show that they substantially underestimate the real genetic correlation (Rausher, 1984; Via, 1984; Fry, 1992; Carriere et al., 1994; Windig, 1997). Our results are consistent with this view and show that genetic correlations estimated from family mean data are lower than those obtained by other methods. However, inspection of the relationship between estimates of family mean correlations and those derived from calculating variance components indicates that there is close agreement when the genetic correlations are low. With medium to large genetic correlations the agreement weakens, a reflection of the large discrepancies that sometimes occur when the correlation estimates are even moderately greater than 0. Note that these are comparisons between different types of estimation and not comparisons to true values (sensuWindig, 1997).
The disagreement between family mean estimates and variance component estimates is largely due to the difference between half-sib variance component estimates and family mean estimates. For this reason we suggest that when using an experimental design that produces half-sib genetic data the genetic correlation should be calculated using variance components derived from REML or anova. Of course, half-sib and other more complex breeding designs are always preferable whenever there is a possibility of maternal effects, common environmental effects, and dominance variance. Thus half-sib designs are most likely to be used when accuracy of estimation is important and the use of REML and anova in these cases is likely to be especially important.
Our results also show that there is slightly greater agreement between different correlation estimates when mean family size increases. This is probably due to decreases in error variance in both family mean and all variance component estimation methods that are presumed to occur with increased family size (Via, 1984; Shaw, 1987; Roff & Preziosi, 1994; Roff, 1997). However as error variance is expected to lower the estimated genetic correlation in all methods, the convergence suggests that the increase in mean family size in these studies has a greater effect on reducing error in the family mean estimates.
Ease of calculation is not a sufficient criterion for choosing a method. Whenever the aim of an investigator is to accurately estimate either genetic variance components or the genetic correlation as a general parameter, anova or REML analysis should be used. In most cases researchers are interested in testing if correlations differ from 0 or 1 and our estimates are expected to be normally distributed around true population parameter. There are two biases that should be noted when variance component approaches are used for these tests. First, in REML approaches the variance components are often constrained to be greater than 0. This constraint will bias genetic correlations upward and has been previously noted in the literature (e.g. Shaw, 1987, Windig, 1997). The second bias results from REML estimation methods that estimate variance separately in each environment. Under this method the estimates are constrained to be no greater than 1 and results in a downward bias in the genetic correlation. This pattern is evident in the data we collected from the literature (Fig. 1b).
Estimates of genetic correlations across environments from family mean correlations underestimate the real genetic correlation. As we show here, anova and REML are consistently higher. Estimates of the relative strength of a correlation, or any study wishing to provide a parameter estimate, require a more sophisticated method than family mean correlation.
Despite these caveats, researchers are often interested in providing some general insights into genetic differences expressed in two environments, or have data that might be used to calculate genetic correlations from family means but are too limited for a full quantitative genetic approach. Field biologists and others who may not have the time or resources for the most appropriate quantitative genetic design, may nevertheless be interested in questions best addressed with quantitative genetic data. For studies that simply wish to establish a genetic basis for changes in phenotype measured in two or more environments, the investigator requires a test of whether the genetic correlation is greater than 0. For this purpose we suggest that family mean correlation is an acceptable method and will provide a conservative estimate due to its tendency to produce lower genetic correlation estimates than other methods. However when the aim of a study is to identify genetic correlations less than 1, the converse is also true. This is important when the investigator wishes to ask questions about the degree of independence between different parts of a reaction norm (Via & Lande, 1987).
We thank L. Cook, J. Wolf, D. Roff, B. Walsh and an anonymous reviewer for comments that improved earlier versions of this manuscript. We also thank D. Roff and A. Kause for providing unpublished data. This work was supported by a Biotechnology and Biological Sciences Research Council (BBSRC) postgraduate fellowship to PAA.
Table 3. Appendix 1 Papers reporting genetic correlations across environments included in this study.
Number of points
Mean family size
Trait type: M, morphology; L, life history; correlation type: FS, full-sibs; HS, half-sibs; C, clones; SS, self-sibs; variance estimation: REML, restricted error maximum likelihood; LS, least squares.
Number of points refers to the number of correlations reported, number in brackets is the number of pair-wise estimates included in analyses.
Mean family size, the number of individuals per family per treatment.