The use of marker-based relationship information to estimate the heritability of body weight in a natural population: a cautionary tale
Stuart C. Thomas, Institute of Cell, Animal and Population Biology, University of Edinburgh, West Mains Road, Edinburgh, EH9 3JT, UK. Tel.: +44 131 650 5440; fax: +44 131 650 6564; e-mail: email@example.com
A number of procedures have been developed that allow the genetic parameters of natural populations to be estimated using relationship information inferred from marker data rather than known pedigrees. Three published approaches are available; the regression, pair-wise likelihood and Markov Chain Monte Carlo (MCMC) sib-ship reconstruction methods. These were applied to body weight and molecular data collected from the Soay sheep population of St. Kilda, which has a previously determined pedigree. The regression and pair-wise likelihood approaches do not specify an exact pedigree and yielded unreliable heritability estimates, that were sensitive to alteration of the fixed effects. The MCMC method, which specifies a pedigree prior to heritability estimation, yielded results closer to those determined using the known pedigree. In populations of low average relationship, such as the Soay sheep population, determination of a reliable pedigree is more useful than indirect approaches that do not specify a pedigree.
In recent years there has been increasing interest in estimating genetic variance components in natural populations, with heritabilities being estimated in hundreds of studies (see meta-analyses by Mousseau & Roff, 1987; Roff & Mousseau, 1987; Weigensberg & Roff, 1996). Accurate estimates are important in the understanding of patterns of short-term evolution, the reconstruction of historical patterns of natural selection (Lande, 1979) and the prediction of genetic responses to selection. In addition they allow inference to be made about the underlying causes of clinal variation, through the comparison of the variance components describing the same traits within subpopulations of the same species (Coyne & Beecham, 1987).
Variance component estimates can provide information on the number of individuals required in order to maintain a viable population, and so are useful for the management of captive populations (Storfer, 1996). The loss of genetic variation is a restricting factor in a species' ability to respond to natural selection, and hence a limitation on its potential to evolve (Lande, 1982; Mousseau & Roff, 1987; Falconer & Mackay, 1996; Lande & Shannon, 1996). Variation is therefore critical for maintenance of species within a changing environment.
Whether variance components are sought for evolutionary insight or conservation biology, standard estimation methods such as regression of offspring phenotypes against parental phenotypes, sib-ship analyses or restricted maximum likelihood (REML) techniques under the animal model (see Lynch & Walsh, 1998), are often difficult or impossible to follow in the wild due to their requirement for known pedigrees. However, by typing individuals at marker loci information may be inferred about relationships (Thompson, 1975; Queller & Goodnight, 1989; Lynch & Ritland, 1999) and using this information, a number of indirect methods have recently been developed (described below) which allow variance component estimation with limited pedigree information. Unfortunately, the inherent inaccuracy of such indirect approaches may restrict their use in practice, where accurate estimates are required in order to avoid erroneous conclusions about the underlying population parameters.
Two techniques have been introduced that do not require exact pedigrees to be specified: a regression approach (Ritland, 1996; Lynch & Walsh, 1998) and a likelihood approach (Mousseau et al., 1998; Thomas et al., 2000). The main advantage of these pedigree-free approaches is that noise in the inferred relationship data may be accounted for in the analysis.
The regression approach includes relationship information in the form of estimates of pair-wise relatedness. It uses a between and within locus ANOVA to remove the sampling error variance of relatedness estimation within pairs from the total variance of relatedness. The ANOVA therefore provides a `noise-free' estimate of the actual variance of the relationships within the population for use in subsequent variance component analysis (Ritland, 1996). The likelihood approach also works on pairs, and accounts for the uncertainty of the relationship data by attaching a likelihood to each of a number of relationship classes into which the pair might be assigned (Mousseau et al., 1998). However, the likelihood approach requires that the relative size of each of the relationship classes considered in the analysis is known prior to study. Its application is therefore limited to populations where such information is available.
The regression approach has been used previously to determine heritabilities of traits in a wild plant population, Mimulus guttatus (Ritland & Ritland, 1996). Resulting estimates were larger than those determined under more controlled conditions. This result is contrary to expectation because, under controlled conditions, environmental variance might be expected to be lower (Coyne & Beecham, 1987; Ritland & Ritland, 1996) although meta-analysis of studies fails to support this idea (Weigensberg & Roff, 1996). However the result may also be a reflection of the large sampling variance associated with this approach (Thomas et al., 2000). The likelihood technique was applied to a captive salmon population (Oncorhynchus tshawytscha), resulting in estimated heritabilities that were similar to previously derived estimates (Mousseau et al., 1998). However, the salmon population was set up under rather specific conditions so that a full-sib population structure, with known prior probabilities, could be assumed.
Alternatively, in a third approach, marker information can be used to infer exact relationships, thereby reconstructing a pedigree suitable for use in traditional variance component analysis, e.g. REML techniques (Patterson & Thompson, 1971; Lynch & Walsh, 1998). The Markov-chain Monte Carlo (MCMC) approach (Thomas & Hill, 2000) is based upon relationship assignment. First, a likely set of sib-ships is reconstructed, and then, under the assumption that the pedigree is correct, REML techniques are used to estimate variance components. The MCMC approach allows well-established methods of variance component estimation to be used, e.g. REML, thus family specific and relationship-specific information is weighted more efficiently than in pair-wise analysis (Thomas & Hill, 2000). However, incorrectly assigned relationships can lead to large bias in variance component estimation (Thomas & Hill, 2000).
For clarification, the MCMC approach to sib-ship reconstruction is also based on likelihood techniques, but it will be referred to here as the MCMC approach and the likelihood-based pair-wise approach as the likelihood approach. The adaptable nature of both likelihood-based approaches allows any information derived from known relationships to be included in the analysis (Thomas & Hill, 2000).
In a recent study, Milner et al. (2000) used a pedigree which was determined through field observation of mother-offspring pairs combined with paternity inference using genetic markers, to estimate the heritabilities of several traits in an unmanaged population of Soay sheep (Ovis aries). Paternities were inferred using CERVUS 1.0 (Marshall et al., 1998), which attaches confidence values to an assigned paternity. Paternities achieving an average confidence of 95% were used in variance component analysis (Milner et al., 2000). Variance components were estimated for both males and females using REML methodology with the data analysed under an `animal' model (Lynch & Walsh, 1998). It was found that heritability estimates for body weight were about 50% lower in males when the pedigree was based upon paternities assigned with 80% confidence and about 30% lower in females (Milner, 1999). This observed reduction, although not statistically significant, might have been because of the bias introduced through inaccurate relationship information.
In this study a Soay sheep data set is analysed using the marker-based systems of variance component estimation. The exact data set used is a modified form of the set used by Milner et al. (2000), and comprised animals born between 1995 and 1999 (inclusive), whose maternal identity was known from observation at lambing. Body weight is used as an example trait and an attempt is made to address the question `which of the approaches produces a good (reliable) estimate of the heritability?' rather than addressing the question `how heritable is body weight?' The promiscuity of both sexes in the study population is one of the most problematic with regards to paternity inference (Pemberton et al., 1999), and it is of interest to see how heritability estimates derived from the newer marker-based approaches compare. Estimates of the heritability are made using all the pedigree free approaches, and comparison is made with approaches that do specify pedigree. Alternate approaches to pedigree reconstruction are examined.
Materials and methods
The Soay sheep population
Soay sheep were introduced to Hirta, the largest island (638 ha) in the St Kilda group (57°49′N, 8°34′W), from a neighbouring island in 1932, and since that time have remained unmanaged. Intensive study of the sheep is restricted to a 170-ha study area that contains approximately one-third of the island's population. Since 1985 over 95% of the animals born within the study area during the April–May lambing have been tagged and sampled for genetic analysis soon after birth (Clutton-Brock et al., 1991, 1992). Behavioural observation of lambs during this period allows the identity of the mother, if tagged, to be established. Genotype information for all putative mother-offspring pairs was consistent with the inferred relationship (Pemberton et al., 1999). Each August, over one-half of the population in the study area is caught, allowing body weight measurements to be taken. Further animals are caught during the November rut, when a number of rams immigrate into the study area. As many of these immigrants as possible are tranquillized, tagged and sampled for subsequent paternity analysis.
The study described here used a subset of the Soay sheep data comprising 529 animals, born after 1994, with 759 body weight measurements. The data set used was a modified form of the one used by Milner et al. (2000), and contained animals caught between 1995 and 1999 inclusive. Animals born before 1995 were excluded, as these had been typed using a different set of markers. In addition only animals with known mothers were included in the sample, regardless of whether the mother's genotype was known. Sampled animals were genotyped at 12 marker loci (Table 1). Across all animals only about 2% of genotype data was missing. A pedigree determined using the likelihood-based paternity inference package CERVUS 1.0 (Marshall et al., 1998), with confidence levels set to 95%, was available for the data set (Pemberton et al., 1999). Details of the paternity inference are available from Pemberton et al. (1999) and Coltman et al. (1999).
The phenotypic, molecular and relationship data were used to estimate a number of fixed and random effects. The fixed effects modelled were sex, age, twin status, year of measurement and day of measurement. The small number of fixed effects included was chosen to reflect those that might be known in a newly studied population. Age and sex were always fitted into the phenotypic model, but for each model only two out of twin status, year of measurement, and day of measurement were added. Age was fitted as an interaction with sex, and was modelled as either a polynomial or as a categorical variable. When fitted as a categorical variable, age had three classes: lamb, yearling and adult. Models where age was fitted as a categorical variable were included to mimic situations where exact ages are unknown, but where some information about age can be inferred from the phenotype of other traits (e.g. teeth).
The phenotypic variance was partitioned into three components: the additive genetic variance (VA), partitioned using pedigree data, derived from known relationship data or marker-based methodologies; the specific environmental variance of an individual record (VEs), partitioned using repeated measurements on the same individual; and the general environmental variance (VEg) common to all records of an animal (Falconer & Mackay, 1996). Maternal effects (or the environmental covariance of full-sibs) were not fitted, because this led to problems of parameter convergence (see discussion). Heritability estimates were used as a summary statistic for the variance components, and were calculated as:
Analyses using the regression and pair-wise likelihood technique involved two steps. Firstly, in order to obtain a single phenotypic measure for each animal in the sample, the REML-based statistics package ASREML (Gilmour et al., 1997) was used, but with no relationship information included in the analysis. The fixed effects and VEs, were therefore estimated prior to analysis using relationship data inferred from marker genotypes. Residual deviations for each animal, equal to the average of the phenotypic values of the repeat readings on the animal after the fixed effects had been removed, were computed in ASREML. The variance among individual deviations is the sum of three components, VA, VEg and VEs/n, where n is the number of repeat records for that individual. When averaged over animals the coefficient of VEs is the harmonic mean of the number of records for each individual.
Secondly, the variance of the residual deviations was then partitioned using the marker information within the regression and pair-wise likelihood frameworks, thereby obtaining an estimate for VA. VEg was estimated from the total variance of the residual errors and the estimates of VA and VEs.
Lynch & Ritland's (1999) estimator of pair-wise relatedness was used to determine relationship information for inclusion in the regression-based estimation procedure (Ritland, 1996). Simulation studies indicated that the use of this estimator in the regression framework yielded heritability estimates with the lowest mean squared errors compared with the use of other estimators of relatedness (Thomas, 2001).
The likelihood approach (Mousseau et al., 1998) used the phenotypic function based on the difference between the phenotype of animals within a pair (Thomas et al., 2000). Four classes of relationship were assumed present for the likelihood approach (Table 2). Two sets of prior information were used to describe the distribution of the relationships within the sample: a `flat' set, where 99% of pairs were assumed unrelated, but where the remaining categories were assumed equally frequent, and an `exact' set, calculated from the 95% pedigree. Analysis was then repeated using the pair-wise likelihood approach, but with maternal data assumed known. When a relationship was known exactly, the probability of that relationship was set to one and that of other relationships set to zero. The likelihood of subsequent relationships for the offspring were updated to incorporate the extra information obtained through the knowledge of the source of one of its alleles. For example, if a mother-offspring pair was known, the parent-offspring relationship category was set to zero for all other comparisons between the offspring and an older female.
Summary of the relationship classes used in the likelihood technique showing the likelihoods for the seven genotype patterns, the distribution of phenotypic difference, the number of each relationship as estimated in the 95% confidence pedigree and the two sets of priors. Total number of animals = 529. Alleles are indexed i
and are mutually exclusive. pi
is the allele frequency of i
, where m
is the harmonic mean of the number of phenotypic observations on each of the pair of animals.
Analysis with the regression and pair-wise likelihood approaches was then repeated, but using the `known' relationships from the 95% confidence pedigree, with the relationship classes restricted to those of the likelihood approach, because these represented the most common relationships within the sample (Table 2).
Standard errors of the variance components were estimated by bootstrapping (Efron & Tibshirani, 1993). Bootstrapping could be operated at two levels, either at the level of individuals before pair-wise analysis, or at the level of pairs. Simulations using known relationships indicated that resampling using pairs greatly underestimates standard errors and that resampling individuals (with pairs made up of the same individual sampled twice excluded) tends to overestimate them (by between 10 and 100%) (unpublished result). Conservative standard errors were therefore calculated by resampling individuals and cannot be regarded as reliable. Bootstrap estimates of the standard error of the additive genetic variance were combined with the large sample standard deviation of the total phenotypic variance (estimated by ASREML) to estimate the standard error of the heritability, using the approximation:
Analyses using pedigrees. REML techniques (Patterson & Thompson, 1971) were used to estimate variance components using an `animal' model (Lynch & Walsh, 1998). This analysis was similar to that performed by Milner et al. (2000), although here only a single trait, body weight (live weight at catch in August), was examined and the sexes were analysed simultaneously. In addition, the number of fixed effects fitted was reduced and was varied to examine the influence of model on marker-based analysis. All the effects were estimated simultaneously using ASREML (Gilmour et al., 1997). This analysis was identical to the first step of the `pedigree-free' analyses but included the pedigree information, and thus also estimated VA and VEs.
Five pedigrees (labelled I–V) derived using some form of pedigree reconstruction were analysed within the REML framework:
(I) A pedigree based upon reconstructed half sib-ships only, not using known mother-offspring relationships. Half-sib families were reconstructed using the MCMC approach of Thomas & Hill (2000), under the assumption that each individual was a member of only one half-sib family and that there were no parent-offspring pairs in the sample. This was not a realistic model of the pedigree within the sample, but was included for completeness.
(II) A pedigree derived from only the known mother-offspring links and no inferred relationships. This pedigree therefore contained only mother-offspring and maternal half-sib relationships, and provided a control for the comparison of the estimates based on inferred relationships.
(III) A pedigree based upon sib-ships reconstructed using MCMC (Thomas & Hill, 2000) when known maternal-offspring data was incorporated into the analysis.
(IV) The 95% confidence pedigree (Table 2), based upon known mother-offspring pairs and paternity inference with CERVUS 1.0 (Marshall et al., 1998).
(V) The 95% confidence pedigree, but using sib-ship reconstruction on those animals not assigned a sire and thereby attempting to regain sib-ship information lost because the sire had not been genotyped.
For all the MCMC reconstructions, uninformative distributions, where any family size was equally likely, were used to describe sib-ship size.
Estimates of heritability obtained using only inferred relationship data and either the regression or likelihood (with `actual' priors) approaches were unreliable and were sensitive to the choice of fixed effects fitted (Table 3). Calculation of the actual variance of the relationship from the 95% confidence pedigree showed that, given the sample size, the variance of the relationships was low (≈0.0005), reflecting a low number of related pairs. The population structure of the Soay sheep therefore makes analysis using just 12 marker loci very unreliable.
Summary of the heritability estimates and standard errors (bracketed) obtained using the different estimators and under the different models.
Likelihood analyses that used `flat' priors are excluded from the Table 3 because the use of flat priors resulted in all estimates of the additive genetic variance components being negative and thus fixed at the zero boundary of the parameter space. Several nonzero estimates were obtained when known mothers were incorporated into the flat prior analysis but these were very close to zero, because of the incorrect prior information. In situations where estimates of the additive genetic variance were fixed at zero, estimates of the genetic variance obtained from bootstrap samples also tended to be fixed. Meaningful standard errors could not therefore be found in those situations.
Low levels of marker information and low relatedness may be partly compensated for through the inclusion of known data into the analysis. When maternal data or the 95% confident pedigree were used in the likelihood analysis, the estimates approached the REML estimates obtained using the 95% confidence pedigree (Pedigree IV), although they generally had larger standard errors (Table 3). Estimates of the heritability obtained using the 95% confidence pedigree information ranged from about 0.2–0.4 regardless of the method of analysis, with only small deviation when different fixed effects were fitted (Table 3). Differences in the heritability estimates made using the pedigree-free and pedigree-determined approaches, when using the 95% confidence pedigree, reflect differences in the weighting of family and relationship information between the techniques. Alternatively they might reflect lower efficiency in the two-step estimation procedure of the fixed and random effects for the pedigree-free analysis compared with the one-step estimation for the pedigree-determined approaches. Likewise, sample error differences for the heritability estimates made from the 95% confidence pedigree reflect either the greater efficiency of REML techniques, or the poor estimation of the sampling errors obtained using bootstrap methodology.
Estimates made using REML techniques and the different pedigrees with assigned relationships indicated that the greater the number of assigned relationships included in the pedigree, the lower the estimate of the heritability. At one extreme, when only inferred relationships were analysed (i.e. Pedigree I – when only MCMC reconstruction of half-sibs was used) heritability estimates were either zero, or negligibly small (not shown in table). At the other extreme, with only known relationships included (Pedigree II – known mother-offspring links only), heritabilities were estimated as between 0.29 and 0.39 (Table 3). This pattern may be explained by downward bias introduced as a result of incorrectly assigned relationships. It may also be explained by the presence of a maternal effect, which would increase the similarity of sibs. Heritability estimates would therefore be biased upwards. The bias would be greatest when only mother-offspring relationships are used to form the pedigree, and would diminish as further (i.e. paternal) relationships are included in the analysis.
Visual comparison of the 95% pedigree (IV) and the MCMC approach with known mothers (III) indicated that a number of the same half-sib ships were recovered, although some were specific to the method of reconstruction. Further comparison of III and a pedigree determined using paternity inference set at 80% showed the same pattern, but with more sib-ships in common. Hence greater numbers of inferred relationships were present in the pedigree when information from the 95% confidence pedigree and sib-ship reconstruction was combined (V) than when information from each was analysed on its own. This helps explain why, when either pedigree III or IV were analysed, the estimates of heritability were intermediate between estimates made from II and V (Table 3).
Conclusions and discussion
The objective of this study was to assess methods that use relationship information inferred from marker data to estimate variance components on an actual population. Two approaches to make use of the marker information were examined, either to gain nonspecific relationship data or to specify exact relationships. The estimates of the heritability obtained from the 95% confidence pedigree using the pair-wise frameworks were regarded as the `best' achievable estimates using the pair-wise approaches. Deviations from these values when inferred relationship information was used were a result of the inaccuracies introduced through relationship inference.
The regression approach gave unreliable results, which deviated wildly when the fixed effects were changed. Low amounts of marker data and low numbers of relatives in the sample resulted in poor estimates of the actual variance of the relationship, which was greatly under-estimated (by 100 times). Estimations of the heritability using the regression approach requires division by the actual variance of the relationship, and therefore, as the actual variance of the relationship was considerably less than zero, small changes caused by alteration of the fixed effects were greatly amplified.
The likelihood approach gave negative estimates of the heritability and so estimates were fixed at the boundary of the parameter space, especially in the situation where the priors were inaccurate. Again this is because of insufficient amounts of marker data to gain useful relationship data, and low numbers of relatives in the sample upon which to partition the variance. The MCMC approach also failed for similar reasons. For these techniques to operate successfully in a natural situation, much greater numbers of relatives are required in the sample as well as a greater amount of marker information. Incorporation of known relationship information into the likelihood and the MCMC approaches allowed more reliable estimates of the variance to be determined.
The likelihood approach requires that population structure be known prior to study. Its application is therefore limited to situations where such information is available. Alternatively, prior probabilities may be inferred from existing knowledge, such as the average life-time reproductive success and the age structure of individuals in the study population. Most of the information on the genetic variance components is derived from close relatives (e.g. full-sib, half-sib and parent-offspring groups), and accurate prediction of the prior probabilities for these groups is important.
In this study, the environmental covariance of full-sibs was not fitted into the model as there are few full-sibs in the Soay sheep population (Table 2), although the likelihood approach and the approaches that assign relationships allow for its inclusion. When a pedigree was used that contained only the known mother–offspring links, estimates of the heritability were larger than when assigned relationships were included. This could be because of bias introduced through the inclusion of inferred relationships in the other pedigrees, or because of the existence of a maternal effect that would inflate the similarity between maternal sibs. Attempts to fit maternal effects into the model often resulted in REML analysis that failed to converge, presumably because of insufficient contrasts within the pedigree. Maternal effects were therefore excluded from the analysis. Milner et al. (2000) found that heritabilities tended to be lower when maternal effects were included, although in body weight the change was not significant.
A problem with all of these approaches is the calculation of the standard errors of variance component estimates. In this study, bootstrap methodology was used to estimate errors for the pair-wise approaches that did not specify a pedigree. In cases where a pedigree was specified, large sample estimates of the variance of the parameters (from ASREML) were used to calculate the standard errors. Neither of these approaches provided a reliable means of estimating the standard errors. In the case of estimates obtained using ASREML, no account is made of the inaccuracy of the pedigree and so estimates of sampling errors are likely to be underestimates. In the case of the bootstrap-derived estimates, simulated studies of balanced populations with known relationships indicated that the sampling errors were overestimated. Ideally the bootstrap would resample over independent data points, a condition clearly violated when resampling over pairs. The individuals within the sample are not independent either, because they share relationships, and so the conditions for the bootstrap are also violated. As a result, when the level of relatedness in the population increases, the accuracy of parameter estimation increases, but the accuracy of standard error estimation decreases.
In the Soay sheep population, sib-ships could be reconstructed via paternity inference. In many cases paternities could be assigned with high confidence, and so would probably lead to the most reliable estimates of variance components. In the absence of information on candidate fathers, MCMC reconstruction of half sibs using the known maternal information provides a means to recreate the lost sib-ships. Indeed, a number of the same sib-ships were reconstructed using the MCMC approach including maternal data, as were reconstructed through assignment of individuals to the same sire using CERVUS (Marshall et al., 1998) although some sib-ships determined were specific to each approach. Increasing the number of assigned relationships led to a decrease in the size of the estimated heritability, probably due to an increase in the number of misassigned relationships, an effect also noted by Milner et al. (2000). Therefore, only relationships assigned with a high degree of confidence should be included in the analysis. Confidence levels may be determined for relationship assignments using sib-ship reconstruction and paternity inference by simulation (Marshall et al., 1998; Thomas & Hill, 2000).
In testing the marker-based approaches, we examined individuals born from 1995 to 1999 inclusive, in order to mimic a field project where each new cohort is sampled and phenotyped. Consequently, there were fewer related pairs within the sample (see Table 2) than existed in the standing population at any one time (data not shown). Although a `standing population' sampling approach would yield more related pairs, the related pairs would be more distantly related on average, and the number of unrelated pairs would increase at a much faster rate; paradoxically therefore the variance in relationship would decrease. We would not expect a `standing population' sampling approach to yield improved estimates over the `cohort sampling' approach described in this paper, especially in populations with small family sizes.
In summary, it is clear from previous simulation studies (Thomas et al., 2000) that for the marker-based approaches to be reliable, the relationship structure of the population is important. There is a basic need for large families or, equivalently, a large variance of relationship in the sample. In addition, these methods require a considerable amount of polymorphic marker data be typed per individual before estimated heritabilities become reliable, unless known relationships are included in the analysis (e.g. maternal information, Thomas & Hill, 2000). In consequence, the techniques are not appropriate for use on all natural populations of interest, and indeed misleading results may be obtained in populations that violate the above requirements. Considerable caution must therefore be exercised before field studies are undertaken, with simulation being a useful tool in deciding the merit of the marker-based techniques. Finally, given the comparative unreliability of marker-based estimates compared with estimates based on known relationships, the determination of a reliable pedigree is of more use in populations of low average relationship than the more indirect approaches.
The authors would like to thank the National Trust for Scotland and Scottish Natural Heritage for permission to work on St Kilda and for their assistance in many aspects of the work. Logistical support was provided by the Royal Artillery Range, Hebrides, it's St Kilda Detachment and the Royal Corps of Transport. We also thank Jill Pilkington and the many volunteers that collected much of the long-term data. The Soay Sheep project has been funded by grants from BBSRC, NERC, the Royal Society and the Wellcome Trust. Stuart Thomas was funded by a BBSRC PhD studentship.