Heterozygosity fitness correlations (HFCs) have frequently been used to detect inbreeding depression, under the assumption that genome-wide heterozygosity is a good proxy for inbreeding. However, meta-analyses of the association between fitness measures and individual heterozygosity have shown that often either no correlations are observed or the effect sizes are small. One of the reasons for this may be the absence of variance in inbreeding, a requisite for generating general-effect HFCs. Recent work has highlighted identity disequilibrium (ID) as a measure that may capture variance in the level of inbreeding within a population; however, no thorough assessment of ID in natural populations has been conducted. In this meta-analysis, we assess the magnitude of ID (as measured by the g2 statistic) from 50 previously published HFC studies and its relationship to the observed effect sizes of those studies. We then assess how much power the studies had to detect general-effect HFCs, and the number of markers that would have been needed to generate a high expected correlation (r2 = 0.9) between observed heterozygosity and inbreeding. Across the majority of studies, g2 values were not significantly different than zero. Despite this, we found that the magnitude of g2 was associated with the average effect sizes observed in a population, even when point estimates were nonsignificant. These low values of g2 translated into low expected correlations between heterozygosity and inbreeding and suggest that many more markers than typically used are needed to robustly detect HFCs.
Heterozygosity fitness correlations (HFCs) have become a prevalent tool in conservation genetics and evolutionary biology (Coltman & Slate 2003; Balloux et al. 2004; Chapman et al. 2009). In these analyses, individual heterozygosity (as averaged over a number of loci) is often used as a proxy for inbreeding (Santure et al. 2010; Townsend & Jamieson 2013) and when associated with measures of fitness (such as survival or reproductive success) may reveal evidence of inbreeding depression. Heterozygosity fitness correlations can be especially useful in situations where a direct measure of inbreeding (such as from a pedigree) is not available, as is often the case for wild and endangered species (Grueber et al. 2008; Ruiz-Lopez et al. 2012; Klauke et al. 2013).
Three processes are thought to potentially underlie HFCs (Hansson & Westerberg 2002). ‘Direct-effects’ result when markers themselves have functional consequences and are directly linked to differences in fitness. ‘Local-effects’ occur when the genotyped markers themselves may not have direct consequences on fitness, but rather are in linkage disequilibrium with variants that are. Finally, ‘general-effect’ HFCs are attributed to an intrinsic benefit to being heterozygous and the heterozygosity of the typed markers is correlated with overall heterozygosity in the genome. The chances of detecting direct- or local-effect HFCs are relatively small, unless a substantial number of loci are considered, and thus, most HFC studies to date have focused on examining general-effect HFCs.
Previous reviews of HFC studies, however, have highlighted that many either do not find correlations, or the effects they find are small (Britten 1996; Coltman & Slate 2003; Chapman et al. 2009; Szulkin et al. 2010). Some authors have argued that the power of HFC studies may be limited either by the demographic history of the population or by the inability of the markers used to reflect that history (Väli et al. 2008; Ljungqvist et al. 2010; Grueber et al. 2011b). The latter problem may be solved by considering large sets of markers (DeWoody & DeWoody 2005; Miller et al. 2014), a feat that is becoming increasingly feasible with simultaneous marker discovery and genotyping (e.g. restriction site-associated DNA sequencing or genotype-by-sequencing; Davey et al. 2011; Elshire et al. 2011) using next-generation sequencing technology.
The issue of demographic history is potentially more complicated. One does not expect to detect general-effect HFCs unless there is variance in the level of inbreeding within a population (Slate et al. 2004; Szulkin et al. 2010), or another event (e.g. admixture) or the mating system (e.g. partial selfing) cause similar changes in the nature of heterozygosity in a genome (David et al. 2007; Szulkin et al. 2010). Recent work has highlighted identity disequilibrium (ID) as one measure that may capture these differences in heterozygosity (David 1998; Bierne et al. 2000; David et al. 2007). ID is the covariance in heterozygosity among markers within individuals, which should reflect identity by descent (IBD) of those markers (Bierne et al. 2000; David et al. 2007; Szulkin et al. 2010). The magnitude of the covariance is affected by demographic history and mating systems. In the absence of ID, the set of markers used in an HFC analysis are expected to only reflect their local genomic environment, thereby limiting a study to detecting only local-effect or direct-effect HFCs.
Two metrics have been developed to test for the presence of ID: heterozygosity–heterozygosity correlations (Balloux et al. 2004) and the g2 statistic (David et al. 2007). Heterozygosity–heterozygosity correlations gauge ID by dividing a given set of loci into two even sets and then assessing the correlation in heterozygosity between them. This process is repeated hundreds of times to yield an average correlation and test its significance. In contrast, the g2 statistic assesses the covariance of heterozygosity between markers standardized by their average heterozygosity, and its significance can be tested by permuting genotypes. Thus, g2 is a population parameter that summarizes the variance in inbreeding, rather than individual realized IBD (Szulkin et al. 2010; Ruiz-Lopez et al. 2012). Recent work has suggested the use of g2 rather than heterozygosity–heterozygosity correlations to assess ID as calculation of the latter involves nonindependent data sets and may be influenced by the properties of the exact marker set used. In contrast, calculation of g2 uses all markers simultaneously, is expected to only be affected by demographic history and is more central to HFC theory (Szulkin et al. 2010).
While a g2 estimate significantly different than zero may be a good indication of the ability to identify a general-effect HFC, Szulkin et al. (2010) noted that nonsignificant values of g2 do not preclude presence or detection of an HFC. They assert that the phenotypic effects of inbreeding are more readily detected than correlations among marker heterozygosity (i.e. ID). Indeed, the recent paper by Kardos et al. (2014) highlighted this same point. Here, the authors used simulations to assess the ability of g2 to detect HFCs due to inbreeding over a range of demographic scenarios and with variable numbers of markers. They found that g2 does give an indication of the strength of HFCs due to inbreeding (i.e. general-effects); however, often either the magnitude of a g2 estimate or HFC calculation will not reach the level of statistical significance, even if an association was present.
Empirical studies are now starting to assess ID in wild populations (e.g. Borrell et al. 2011; Olano-Marin et al. 2011a; Wetzel et al. 2012). However, no thorough review of the metric, or its relation to observed effect sizes of HFCs, has been conducted. In this work, we perform a meta-analysis to examine the effect of ID in empirical HFC studies. We first assess the magnitude of ID (as measured by g2) in previously published HFC studies. We then look to see whether the magnitude of g2 is a predictor of the observed effect sizes in these studies. Finally, we use recently derived equations (Miller et al. 2014) to examine how much power the studies had to detect general-effect HFCs, and the number of markers that would have been needed to generate a high expected correlation (r2 = 0.9) between marker genotypes and inbreeding.
For this meta-analysis, we only considered studies of outbreeding vertebrates that conducted individual-based HFC analyses. Studies that pooled individuals to create population averages were not considered. Studies also needed to report a summary statistic that we could convert into an effect size (see below). Our literature search began by considering all papers citing David et al. (2007) that reported a g2 statistic. We then expanded our criterion by searching the Dryad Digital Repository (http://datadryad.org, last accessed November 2013) for HFC studies with publicly available data sets from which we could calculate a g2 statistic. For this, we used the key words ‘heterozygosity correlation’, ‘fitness’ and ‘inbreeding’. In addition, we conducted a Google Scholar search (www.scholar.google.com; last accessed November 2013) for ‘Heterozygosity Fitness Correlations’, limiting dates of the papers returned to those published since 2009. This filter was applied to increase the chance that genotypes would be archived online. However, these measures resulted in only a small number of studies (n = 18) with either online marker data or g2 estimates. Thus, to increase our sample size, we directly solicited data from the authors of studies cited in Chapman et al. (2009), as well as those citing Chapman et al. (2009) but that did not have marker data available online. In these cases, authors were asked to provide either raw genotype data, tables of individual homozygosity at each locus, or to calculate g2 themselves.
For each study, we recorded the following covariates: number of individuals, number of loci, heterozygosity metric [i.e. multilocus heterozygosity, standardized multilocus heterozygosity (Coltman et al. 1999), homozygosity by locus (Aparicio et al. 2008), internal relatedness (Amos et al. 2001) or mean d2 (Coulson et al. 1998)], average observed heterozygosity and trait category. Internal relatedness is a measure of heterozygosity that takes into account the frequency of alleles in the population such that rare alleles have higher weight (Amos et al. 2001), while mean d2 measures the squared difference in allele sizes at each locus and then averages those values over all loci (Coulson et al. 1998). Trait categories were chosen to match Coltman & Slate (2003) and Chapman et al. (2009): life history (e.g. survival, breeding success), morphological (e.g. size, symmetry) and physiological (e.g. parasite burden, hormone levels).
A common feature of many HFC studies is for the authors to examine multiple traits within a single population, for example egg size, clutch size, hatching success and fledging success (Wetzel et al. 2012). It is also common for studies to examine more than one heterozygosity measure due to different rationales underlying their calculations (Coulson et al. 1998; Amos et al. 2001; Aparicio et al. 2008), although recently this practice has been discouraged (Chapman et al. 2009). If a study used multiple measures of heterozygosity, or reported multiple HFCs for different traits in the same population, we recorded these as independent data points. Whenever possible, we updated the associated covariates to reflect the specific subset of individuals used for each HFC.
If reported, g2 was recorded from the manuscript; otherwise, it was calculated from available marker data via RMES (David et al. 2007) using 1000 permutations to test the significance of the measure. g2 was recorded or calculated for each population and where possible, each subset of individuals identified by the authors. For example, if HFCs for male and female breeding success were reported independently, we would calculate g2 estimates for males and females separately.
Effect size calculations
We recorded the correlation coefficient (r) between heterozygosity and fitness measures. If r was not reported, we recorded other summary statistics and then transformed them to r following Coltman & Slate (2003). If t values were reported, we used
where d.f. is the degrees of freedom on which the test was based. If an F statistic was reported, we used
If a χ2 value was reported, we used
where n is the sample size. For R2 values, the transformation was via
where p is the number of parameters in the model. Finally, if only P-values were reported, we used
where Z2 is the standard normal deviate of the P-value. In cases where exact P-values were not given, but rather stated as >0.05 or ‘nonsignificant’, we set P equal to 0.5. Directions for effect sizes were assigned a posteriori depending on the observed correlation in the study; for studies using internal relatedness or homozygosity by locus, the sign of the correlation was reversed to match the other estimators.
Finally, r values were transformed into effect sizes for use in all subsequent analyses using the equation
As noted above, it is common for HFC studies to examine multiple fitness measures, multiple measures of heterozygosity or combinations of both using the same sets of individuals and markers. Thus, inherent in many HFC studies is a level of pseudoreplication or nonindependence. Previous meta-analyses of HFCs have dealt with this problem by running several models that each treat the data in a different way and then comparing the results (Coltman & Slate 2003; Chapman et al. 2009). These included ignoring the issue of pseudoreplication and treating each point as independent, averaging metrics within studies, or running mixed-effect models including study, species and population as random effects. In our case, the issue of multiple fitness or heterozygosity measures within a study is coupled with the fact that often only one measure of g2 was available, even if different subsets of individuals were used to calculate several HFCs. Thus, to avoid pseudoreplication, we performed a univariate analysis considering a single effect size and g2 estimate for each trait type within a population. A univariate analysis is a conservative approach and avoids the additional pseudoreplication that would come from having one g2 estimate associated with multiple points in a multivariate mixed model. We chose to keep trait categories separate as one may expect HFCs to be more apparent in life history traits than morphological ones due to different selection pressures (directional vs. stabilizing) that act on each (Chapman et al. 2009; Szulkin et al. 2010).
Weighted average effect sizes were calculated using the formula:
where zi is the i'th effect size and xi is n-3 samples that went into calculating zi. For example, Luquet et al. (2013) looked at six traits (male body size, female body size, male body condition, female body condition, clutch mass and egg size) in four population of European tree frog (Hyla arborea). We considered four of the traits to be morphological and two to be life history related. To obtain average effect sizes, we grouped the individual effect sizes within each trait type and calculated a weighted average using the formula above. This was done for each population separately resulting in a total of eight effect size estimates. In cases where authors presented analysis of the same traits using multiple measures of heterozygosity, we considered all measures of heterozygosity together [similar to MLHinc of Chapman et al. (2009)]. Grand mean effect sizes and their confidence intervals were back-transformed to r for the presentation of summary statistics. For populations where multiple g2 values were available, we calculated an average g2.
Average effect size values were used in a linear model and assigned weights based on the variance of the values (as estimated by 1/(navg-3), where navg is the average number of individuals per population). All linear models were fit in R version 2.15.2 (R Development Core Team 2005). We included g2, trait type, average number of loci and average heterozygosity as covariates in the model. Model simplification proceeded using an information theoretic approach (Grueber et al. 2011a) as implemented in the package MuMIn version 1.9.5 (Bartoń 2009). We fit a maximal model containing all covariates and then assessed model differences with AICc (AIC values corrected for small sample sizes) values using the dredge function. In cases where ΔAICc scores did not differ by more than 2, we retained the simpler model.
We tested for evidence of publication bias using Egger's regression (Egger et al. 1997). Specifically, we regressed normalized study average effect size (i.e. average effect size divided by the standard error of the measurement) against average sample size. Publication bias is indicated by an intercept that is significantly different than zero (Sterne et al. 2005a,b).
Power of studies to detect inbreeding
We investigated the predicted correlation between observed heterozygosity and inbreeding with eqn 5 of Miller et al. (2014). Here, the correlation between inbreeding and heterozygosity is a function of the number of loci considered, their average heterozygosity (as reported in the manuscript or calculated from available marker data) and the magnitude of ID as measured by g2. In cases where the estimate of g2 was negative, we set the correlation to 0. Finally, we calculated the number of markers that would have been needed for these populations to have a large correlation (r2 = 0.9) between marker heterozygosity and inbreeding. To do this, we modified eqn 5 of Miller et al. (2014) solving for the number of loci (La)
where h is average observed heterozygosity of the markers.
Data acquisition and summary statistics
The literature search and survey of researchers resulted in data from 50 studies (49 papers and one PhD thesis). Collectively, these represent 45 species and 105 populations or subsets of individuals, for a total of 585 individual effect size estimates (Table 1). Study averaging within each trait type resulted in 129 effect size estimates which are summarized in Table 2. The average r value for each trait type was low (range: 0.025–0.064) but none of the 95% confidence intervals cross zero (Table 2). Egger's regression indicated that the intercept was not significantly different than zero (intercept = 0.255, 95% CI = −0.066–0.575; Fig. S1, Supporting Information).
Table 1. Taxa, studies, number of populations and effect sizes for each trait type included in the meta-analysis. Trait types are either life history (LH), morphological (M) or physiological (P)
Table 2. Number of estimates (k), the average effect sizes (r) and their confidence intervals for each trait category
We found a wide range of g2 estimates (average ± SD = 0.007 ± 0.022; range: −0.058–0.159) for the 129 effect sizes, only 26 of which were significantly different than zero. These 26 estimates had an average value of 0.025 (±0.031). To better understand what may be driving this wide range of g2 values, we examined the relationship of g2 to other covariates. However, we found that there was no relation to the average number of loci (Pearson's r = −0.006, t127 = −0.066, P = 0.948), average sample size (Pearson's r = −0.031, t127 = −0.349, P = 0.728) or year of publication (Pearson's r = −0.118, t127 = −1.342, P = 0.182).
Our model selection criterion retained four models within two AICc values of the top ranked model. Three contained g2 and one of the other covariates (either average heterozygosity, average number of markers or the trait category), while the second highest ranked (ΔAICc = 0.24 from the top model) contained only g2. This model containing only g2 was retained for further inspection. Here, g2 was positively correlated with average effect size (estimate ± SE = 2.66 ± 0.73; adjusted R2 = 0.09; Fig. 1 left). Although four outliers are visible in the graph, removing these points did not change the pattern observed (g2 estimate ± SE = 1.54 ± 0.67; adjusted R2 = 0.03) and so were left in for all further analyses.
Given that the majority of g2 values were not significantly different than zero, we conducted two additional analyses. In the first, we set all g2 values that did not differ from zero to zero. This resulted in the same four models being selected with ΔAICc < 2, but now the top ranked model was the one containing only g2 (ΔAICc = 1.36 from the next model). This new model still showed a positive relationship with average effect size, but the magnitude of the relationship and associated R2 value were increased (estimate ± SE = 5.85 ± 1.02; adjusted R2 = 0.20; Fig. 1 centre). In the second analysis, we considered only the g2 values that differed from zero. This returned only two models with ΔAICc < 2. The top model contained only g2, and the second contained g2 and average heterozygosity (ΔAICc = 1.41 from the top model). As with the two previous models, g2 had a positive relationship with average effect size, but the magnitude of the estimate and the R2 were increased yet again (estimate ± SE = 6.05 ± 1.63; adjusted R2 = 0.34; Fig. 1 right).
Power of studies to detect inbreeding
After zeroing negative correlations (n = 32), average expected correlation between marker heterozygosity and inbreeding was 0.13, but a wide range of values were observed (0–0.82, Fig. 2). Estimates of the number of loci needed to achieve an r2 = 0.9 between heterozygosity and inbreeding were then based only on those values with g2 estimates greater than zero (n = 95). Corresponding to the low average g2 values in these populations, a large number of loci (average ± SD = 5611 ± 8996) would be needed to have correlations of 0.9. This distribution of values is shown in Fig. 3.
As with previous meta-analyses of HFC studies (Britten 1996; Coltman & Slate 2003; Chapman et al. 2009), we found that average effect sizes were very low. For life history and morphological traits, we observed average effect sizes lower than the study average values reported by both Coltman & Slate (2003) and Chapman et al. (2009). However, unlike those two previous studies, the 95% CI for physiological traits did not overlap zero (Table 2). We found no evidence of funnel plot asymmetry (intercept was not different from zero) using Egger's regression, suggesting that in our sample studies with small sample sizes are not overestimating effect sizes nor is there a publication bias against studies with negative results. This contrasts previous evidences for publication bias that were found by Coltman & Slate (2003) and Chapman et al. (2009).
Our assessment of g2 from empirical HFC studies found a wide range of values, the majority of which were not significantly different than zero (Fig. 1). We found no evidence that the magnitude of g2 was associated with the average number of markers or average sample size. In addition, there was no trend in observed g2 values over time, despite the fact that both the number of individuals and the number of loci considered in each HFC study were observed to have increased over time.
The univariate analysis of study average effect sizes highlighted g2 as the only variable associated with average effect size. As predicted (Szulkin et al. 2010), the relationship between g2 and effect size was positive, where studies that had large estimates of g2 were able to explain more of the variance between heterozygosity and fitness. This pattern holds even with inclusion of both significant and nonsignificant g2 estimates (Fig. 1), but the relationship explains a small amount of variation. When we reduced the data set to only estimates of g2 that were significantly different than zero, the association increased greatly (adjusted R2 = 0.34 vs. 0.09). Thus, even when nonsignificant, g2 is still an indicator of the presence of general-effect HFCs, but there is a lot of noise around the estimate. These findings are in line with previous simulation studies (Kardos et al. 2014) as well as theoretical predictions of the influence of g2 on the correlation between heterozygosity and fitness (Miller et al. 2014).
Part of the reason previous studies have reported small effect sizes could be that they were simply underpowered. We tested this hypothesis by looking at the expected correlation between observed heterozygosity and inbreeding (Miller et al. 2014) for each population. We found that many of the previous HFC studies had low expected correlations between heterozygosity and inbreeding (Fig. 2). Coupled with the low g2 estimates, lack of power may have been due to the small number of loci used. On average, 40 loci were used in a study (although this number drops to 19 if we exclude the work of Forstmeier et al. (2012) who genotyped 1359 SNPs in a population of zebra finches, Taeniopygia guttata). A much larger number of loci (approximately 5611) would have been needed to confidently explore general-effect HFCs in these populations (Fig. 3).
Interestingly, the observation that all studies seem to be underpowered contradicts the thinking that the first study to publish a significant result will set an effect size threshold against which future studies have to match in order to be published, a so-called winner's curse (Zollner & Pritchard 2007; Nakaoka & Inoue 2009; Forstmeier & Schielzeth 2011). This curse leads to inflation of effect size estimates and publication bias against any study that reports a null result or lower correlation. In contrast, our findings hold that average effect sizes were small, all the studies were underpowered, and there was no evidence for a publication bias. A supplementary analysis also showed that there was no trend in average effect size over time (estimate ± SE = −0.01 ± 0.00, P = 0.09), which stands in contrast to other studies (Jennions & Moller 2002). Taken together, this indicates that our pool of HFC studies has managed to avoid the ‘winner's curse’ and allows for robust inferences to be made in this study.
The observation that HFC studies will need a significantly large number of markers has been suggested by others (Balloux et al. 2004; Väli et al. 2008; Ljungqvist et al. 2010; Forstmeier et al. 2012; Miller et al. 2014; Kardos et al. 2014), and it is now becoming possible to generate such large marker sets by capitalizing on genomic technology (Ekblom & Galindo 2011; Angeloni et al. 2012). One point to consider is that all but one of the studies we included (Forstmeier et al. 2012) were based on microsatellite data. Moving forward, we imagine that most new large-scale data sets will be of SNP loci rather than microsatellites as SNP genotyping can be automated (Shen et al. 2005), while scaling up microsatellites genotyping is not currently possible. Thus, the estimates we present of the number of loci required for a robust HFC study likely represent a lower bound as, on average, microsatellites have higher heterozygosity than SNPs. Higher heterozygosity translates to higher expected correlation to genome-wide heterozygosity if the same number of loci is considered (Miller et al. 2014). We should also note that setting the desired correlation to inbreeding at r2 = 0.9 necessitates the need for more markers than smaller values would. It will be up to individual researcher to determine their desired level of correlation when performing similar calculations.
More broadly, use of large sets of loci will be a great boon to researchers investigating HFCs. Not only will they allow for confident exploration of general-effect HFCs, but also for detailed assessments of local-effects or direct-effects, especially if the loci are anchored in the genome, via linkage mapping or alignment to a reference sequence, so that genes of interest can be identified (Slate et al. 2009; Olano-Marin et al. 2011b; Voegeli et al. 2013).
In this meta-analysis, we assessed the magnitude of identity disequilibrium (as measured by the g2 statistic) in 109 populations or analysis subsets from 50 previously published HFC studies. Across the majority of studies, g2 values were not significantly different than zero. However, we found that the magnitude of g2 was associated with the average effect sizes observed in a population, even when nonsignificant g2 estimates were considered. These low values of g2 also translated into low expected correlations between heterozygosity and inbreeding and suggested that many more markers would have been needed for robust HFC calculations.
However, we would argue that before researchers concern themselves with getting a large number of markers, they should consider the demographic history of the population, and if it will be possible to detect general-effect HFCs. Such an assessment can be made with a small preliminary data set to gauge ID at the outset. Although point estimates may not be precise (especially if the estimate is not different than 0), it can give a sense of the effect size that could be observed and the number of markers that will be needed for robust HFC calculations. We imagine that HFC analysis will remain a key toolset used by both researchers and wildlife managers, and with genomic techniques, new avenues of research into local- or direct-effects may be on the horizon.
We would like to thank all of the researchers who took time to reach back in their archives and provide data sets for this study. We would also like to thank Jessica Haines for statistical advice as well as Corey Davis, Rene Malenfant, three anonymous reviewers and Loius Bernachez for helpful comments on the manuscript.
JMM's graduate research was funded by an Alberta Innovates graduate scholarship, a Natural Sciences and Engineering Research Council of Canada (NSERC) Vanier scholarship, the University of Alberta and the Killam Foundation. DWC is supported by an NSERC Discovery Grant.
J.M.M. conceived the study and drafted the original manuscript. D.W.C. provided analytical guidance and input to the manuscript throughout it's preparation.