Comparison of univariate and multivariate methods for meta-analysis
Multivariate methods of meta-analysis in a mixed-effects model framework represent an improvement in our ability to draw valid conclusions when pooling the results of multiple studies in ecology and evolution, and indeed in all fields of quantitative science, as they help to account for a major problem with meta-analysis, that of nonindependence of data points. Nevertheless, we chose to conduct two types of univariate meta-analysis in addition to a more comprehensive multivariate meta-analysis, both to facilitate comparisons with earlier meta-analyses of such studies (Britten 1996; Reed & Frankham 2001; Coltman & Slate 2003) and to allow direct comparison between the methods.
The ‘all effects independent’ univariate analysis revealed weak but significantly positive mean effect sizes for the all of the genetic metrics with the exception of St d2; stronger effect sizes were found for life history than morphological and physiological traits. These results differ in some ways from findings of previous univariate meta-analyses: Britten (1996) found an overall effect size of r = 0.133 for HFCs, while Reed & Frankham (2001) found an overall effect size of r = 0.217, with LH traits exhibiting a weaker and negative mean effect (r = –0.110) when compared with M traits (r = 0.311). More recently, Coltman & Slate (2003) found that the overall effect size for life-history traits was r = 0.086 and r = 0.048 when MLH and mean d2, respectively, were used as the genetic metric, while effect sizes were lower for morphological and physiological traits (r = 0.004–0.008). Our ‘study unit average’ univariate analysis revealed an increase in the estimated weighted mean effect size for both LH and M traits, but a smaller difference in effect size between these two trait types (LH vs. M: Δr = 0.056 when all effects treated as independent data, Δr = 0.039 for study unit average analysis). This was again in broad agreement with previous results: C&S found an increase in mean effect size for both LH (r = 0.112) and M traits (r = 0.052), but a decrease in the magnitude of difference between LH and M traits when using the study unit average approach (Δr = 0.080 for independent analysis, Δr = 0.060 for study unit average analysis). In plants, the weighted mean correlation between heterozygosity and fitness was also significantly positive (r = 0.306), and was significantly influenced by mating system, whereby self-incompatible plants showed significant positive HFCs whereas self-compatible plants did not, but not by plant rarity or longevity (Leimu et al. 2006). It is interesting to note the much stronger mean effect size found for plants than for animals. This is perhaps due to the fundamental differences in mating system and demography between plants and animals. For example plant populations can exhibit strong demographic population structure (such as age structure) not often seen in animal populations. A lack of published information meant that Leimu et al. (2006) were not able to specifically address the impact of demography on HFCs in plants. However, it should also be noted that study sample sizes in Leimu et al. (2006) are very small (n = 2–14), and the authors recommend their conclusions be considered preliminary. Methodological differences between our study and that of Leimu et al. (2006) may also help to account for the large difference in mean effect sizes seen between plants and animals. For example, along with heterozygosity, Leimu et al. (2006) also considered percentage of polymorphic loci and the number of alleles as measures of genetic diversity.
The multivariate meta-analysis we carried out here also revealed weak but significantly positive mean effect sizes for the genetic metrics MLH, SH, IR and mean d2 and a nonsignificant mean effect size for St d2. This lack of significance for St d2 may be due to a paucity of studies that have reported this metric (14 effect sizes from 9 study populations); however, it seems likely that standardizing this measure actually results in a loss of genetic signal, and standardized d2 is increasingly considered to be less informative than other measures (Hoffman et al. 2006).
Only one study included in the meta-analysis identified outbreeding depression in their population (Marshall & Spalton 2000), and, given that outbreeding depression is thought to be rare in animal populations (Frankham 1995a; Pusey & Wolf 1996), we felt justified in assuming that overall, any signature of outbreeding depression in our meta-analysis would be negligible. Nevertheless, it is possible that some populations included here actually had high rates of undetected outbreeding depression. Such populations would be expected to exhibit strong negative HFCs, and as such may have lowered our global estimates of weighted mean effect size. Negative effects were less common than positive effects among the studies in our meta-analysis (34% of effects), and strong negative HFCs were especially rare (3.8% of effects were Zr ≤ −0.25), but were detected using both MLHinc and mean d2 metrics (Fig. 2), in contrast to the analysis of C&S, where strong negative effects were only detected with mean d2.
While the study unit average analyses suggested that effect sizes are strongest for life-history traits, this pattern was not mirrored in the multivariate meta-analysis; here mean effect sizes were much more similar between the three trait types. Thus, the large difference in mean effect size for life history vs. morphological traits found in C&S was only partially supported here, and was dependent upon the genetic metric employed and the type of meta-analysis conducted. The change in effect size dependent on the method used suggests that the low value found for M traits in the initial independent effects univariate analysis, was due, at least in part, to psuedoreplication of effect sizes. Sequentially dropping each of the higher order nested random factors (class, family, and species) in the multivariate analysis revealed increasingly large differences between life-history traits and morphological and physiological traits (data not shown). This again suggests that earlier meta-analyses have not adequately controlled for pseudoreplication within and between studies, and that the ‘study-unit average’ approach used here, and taken by some authors, does not fully account for replicated results.
The model averaging analysis showed that the main factors that appear to be influencing the magnitude of reported effect sizes in this meta-analysis are properties intrinsic to the study design; namely (i) whether St d2 or any of the other four genetic metrics was used to determine genetic variability; and (ii) whether the study had been published or not. Intrinsic properties of the individuals and populations studied, such as the type of trait measured (LH/M/P) and the ecological setting of the population (wild/captive/domestic) had much lower influence on mean effect sizes, as models incorporating these factors were less well supported (Table 6) and these variables had lower relative importance. In general, this suggests a rather poor fit between the results of HFC analyses and expectation based on population genetic theory. Below we discuss how these results relate to the objectives of the paper listed in the introduction.
Evidence for publication bias
We found three lines of evidence for a bias towards publishing significant effects. First, unpublished studies had smaller effect sizes than published studies. Indeed, in the univariate analysis, unpublished effect sizes for all genetic metrics were not statistically different from zero. Second, the funnel plots show clear evidence for missing studies with small effects and low sample size. Third, the trim and fill analysis suggested there were 48 missing effects from the pool of 481 published effects used in the analysis, suggesting that around 10% of all HFC effect sizes recorded by researchers go unreported in the literature. The analysis suggested that if these ‘missing’ effect sizes had been published, weighted mean effect sizes would be weaker, albeit still significantly positive. The trim and fill analysis of papers published since C&S suggested that only around 4% of detected HFC effect sizes now go unpublished, suggesting a reduction in bias since the publication of C&S.
It has been suggested that meta-analyses are biased towards finding positive effects, especially if care is not taken to identify and include ‘missing’ nonsignificant results (Kotiaho & Tomkins 2002; Jennions et al. 2004; Tomkins & Kotiaho 2004). However, our meta-analysis suggests that publication bias in this field does not necessarily result in spurious conclusions being reached with regard to the existence of HFCs, as the trim and fill analysis suggested that including ‘missing’ studies would still reveal weakly significant positive effects. One reason for this may be that HFC studies generally report many correlations, and often many of these are nonsignificant; indeed, of the 481 effect sizes we collected from published studies, only 115 (24%) were significant at the P = 0.05 level. This indicates that there is no dearth of nonsignificant results in this field. However, many of these nonsignificant results were published in papers also reporting significant results. We have no way of knowing how many studies that fail to detect any significant results at all remain unpublished, although the trim and fill analysis suggests that this may not be a serious problem (although see Koricheva 2003). The publication bias seems to be strongest for negative results based on small sample sizes (Fig. 2). We also found evidence to suggest that publication bias in this field may have lessened since the publication of C&S, where this problem was first identified as a potential issue for microsatellite HFC studies.
Temporal trends in HFC studies
If it is reassuring to find a recent reduction in publication bias in this field (suggesting heeding of advice in C&S), other temporal trends inspire less confidence. Coltman & Slate (2003) suggested that sample sizes were often too small in this field, but there is little evidence that the number of individuals sampled, markers assayed, or total genotypes scored have been increasing since the publication of C&S (Fig. 3). As can be seen by visual inspection of funnel plots (Fig. 2), studies with small sample sizes exhibit large fluctuations around the estimated ‘true’ effect size when all studies are pooled, which may potentially have resulted in an overemphasis of the importance of this field of research. Certainly in small populations, there may be a limit to the number of individuals that can be realistically sampled, but we found no relationship between sample size and population size (Fig. S3). Furthermore, even when few individuals are available for sampling, researchers should aim to maximize the number of markers assayed, and ultimately genotypes scored, in order to have confidence in their measure of genetic diversity. Given recent developments aimed at reducing laboratory costs (e.g. Schuelke 2000; Wang et al. 2003; Symonds & Lloyd 2004; Guicking et al. 2008), and theoretical and empirical results showing that accurate estimates of genome-wide heterozygosity require large numbers of genotypes (Balloux et al. 2004; DeWoody & DeWoody 2005; Väli et al. 2008; Alho et al. 2009), it is somewhat surprising that this does not (yet) seem to have resulted in a concurrent increase in the amount of molecular data gathered for HFC studies.
Comparison of different genetic metrics
There has been a great deal of debate in the literature as to which genetic metric most accurately acts as a surrogate of the true inbreeding coefficient of an individual (see, for example, Aparicio et al. 2006). Here we show that the metrics MLH, SH, and IR are highly correlated and nonindependent; we would encourage researchers to report only one of these genetic metrics when publishing HFC studies in the future, since reporting the correlation between a fitness measure and multiple highly correlated genetic measures is pseudoreplication. It is important that this choice is made before data analysis begins, rather than driven by posthoc choices based on statistical significance, as this will result in another layer of bias in this field. In the multivariate meta-analysis, we found similar mean effect sizes with the three MLHinc genetic metrics, namely MLH, IR and SH (Table 4); mean effect size for mean d2 was smaller, and that for St d2 was not significantly different from zero. The rationale for mean d2 as a metric has been called into question (Hedrick et al. 2001; Tsitrone et al. 2001; Goudet & Keller 2002; Slate & Pemberton 2002; although see Neff 2004a; Kretzmann et al. 2006); and both d2 measures correlate only weakly with other, more direct, measures of heterozygosity. Until microsatellite mutational processes are more accurately elucidated, it seems likely that the relevance of these measures will continue to be debated (Slate et al. 2000), and as such we advocate the use of the simplest metric, MLH, in future HFC studies. Use of more than one metric is likely only justified when certain population demographic histories exist, where the use of MLH and mean d2 in tandem might actually provide insight into evolutionary processes such as stabilizing selection (Neff 2004b), or in populations where all individuals are highly heterozygous, and thus more traditional measures of heterozygosity fail to differentiate between individuals (Hedrick et al. 2001; Tsitrone et al. 2001; Goudet & Keller 2002; Slate & Pemberton 2002).
Comparison of different trait types: what constitutes an HFC study?
Studies reporting correlations between measures of individual variation and genetic diversity are collectively known as HFC studies. To some extent, this is a misnomer — in the published literature to date, the majority of traits for which relationships with genetic diversity have been reported (e.g. morphological traits) are likely to have little or no linear relationship with fitness. For instance, many morphological and physiological traits are more plausibly under stabilizing selection around an optimum, and this may also be true for other traits as well. For example, it is often assumed that life-history traits such as timing or age of first breeding and clutch or litter size are under strong directional selection, but these traits evolve in concert with selection on other life-history traits, with which they may exhibit genetic covariance (e.g. between clutch size and offspring size); only fitness itself can always be assumed to be under positive directional selection. In the absence of directional selection on a character, there is no clear reason to expect a relationship between that trait and heterozygosity, however extensively this is measured, and however high the variance in inbreeding within the population. In this light, the expectation about the strength of effect size for HFCs with respect to different classes of trait is perhaps unrealistic. Interestingly, we found effect sizes to be of a similar magnitude when classed as fitness or nonfitness traits (see Appendix S2, Supporting information). While it is possible we were too generous with our classification of fitness traits and too conservative with our classification of nonfitness traits, given that we also found reasonably similar effect sizes for life-history and morphological traits, our results suggest that such broad classification of traits does little to enhance or understanding of the underlying causes of HFCs in animal populations. Thus, while we do not advocate renaming HFC studies, as use of this term is now widespread, we do advocate consideration of the likely form of selection on characters, and it might be illuminating to explore the relationship between the HFC effect size for traits and the form of selection on those traits for which empirical estimates of selection intensity are available.
Evolutionary theory would suggest that we should only expect correlations of genetic diversity with fitness-related traits because dominance variance is expected to be high for traits with a direct effect on fitness, and such traits have a more complex genetic architecture (Crnokrak & Roff 1995; DeRose & Roff 1999; Merilä & Sheldon 1999). It is thus surprising to find that mean effect sizes for fitness and nonfitness traits were of a similar (positive) magnitude. This might be due to publication bias if papers reporting correlations with nonfitness measures are more likely to be published if effects are large, whereas papers assessing the relationship with fitness are equally likely to be published regardless of effect size. Additionally, measurement error may be greater for life history than morphological and physiological traits; such an explanation has been invoked to explain why morphological traits appeared to exhibit stronger directional selection than life-history traits (Kingsolver et al. 2001).
Does demographic history influence the strength of HFCs?
We found no evidence that populations likely to have higher inbreeding variance exhibited stronger HFCs than populations likely to have low inbreeding variance (highly inbred or outbred populations, Fig. 5, Table S2). While our measures of demographic structure were quite crude, it is perhaps surprising that they did not reveal a coarse relationship in the expected direction. This may suggest that a majority of studies are actually detecting local, rather than general, effects (Balloux et al. 2004), or alternatively that publication bias strongly clouds any pattern in the data. A study by Hansson et al. (2004) employing within brood comparisons found that even when the inbreeding coefficient is held constant, more heterozygous individuals were more likely to recruit to the local breeding population — strong evidence for local effects. Testing for local effects by regressing each individual marker with fitness is becoming standard practice; such tests are valuable in allowing us to understand the mechanisms underlying HFCs, and may also allow future identification of functionally important loci (e.g. Acevedo-Whitehouse et al. 2006; Luikart et al. 2008). However, caution must be exercised here: such multiple tests for significance will result in spurious significant results unless authors are careful to adjust the critical α level and thus guard against inflated type I errors (Simes 1986; Aiken & West 1991). It should also be remembered that single loci correlations are not independent because heterozygosity is correlated across loci (P. David, personal communication), and that, even under the general effect hypothesis, we still expect more than 5% of loci to show single locus HFCs. The standard approach should be to examine the distribution of effect sizes and identify outliers as being those effect sizes that may be statistically, and biologically, significant.
A fruitful direction for future studies of HFC would be to specifically address the impact of demographic factors by sampling from multiple populations such as island and mainland populations, populations with varying levels of habitat disturbance, or populations from a continuum of bottlenecking or founder events. While this approach will not be possible in small, endangered populations with limited range distributions, studies in more widespread species may help provide insight into the demographic processes important in endangered populations and thus help to inform conservation decisions (Reed & Frankham 2003; Grueber et al. 2008).
The future of HFC studies — where to from here?
The results of this meta-analysis indicate that while heterozygosity-fitness correlations may well be a general phenomenon in many wild vertebrate populations, these effects are very weak, equivalent in strength to correlations that explain < 1% of the variance in traits. As discussed above, the various measures of heterozygosity now in common use are not statistically independent, and should not be used in concert, as this will result in psuedoreplication. We would also encourage researchers to base future studies of HFCs in wild populations on the measurement of large numbers of individuals with larger marker panels. Furthermore, we would argue that the goal of such studies should ultimately be to infer evolutionary processes in populations (e.g. Slate et al. 2000), and as such an increase in the number of studies reporting HFCs in populations with known individual inbreeding coefficients would be beneficial (e.g. Coulson et al. 1998; Slate et al. 2004; Bensch et al. 2006; Olafsdottir & Kristjansson 2008), as would a more explicit investigation of the role of other demographic processes such as bottlenecks, admixture and the role of genetic purging. Another avenue of research that may well prove fruitful is to investigate the role of environmental stress on influencing the magnitude and direction of HFCs detected with microsatellites. Stress, such as periods of low food availability, high predation or increased environmental disturbance, is a key factor in reducing the fitness of populations, and individuals will vary in their response to stress (Hoffmann & Hercus 2000). This can result in an increase in genetic variance at the population level, for example due to the expression of genetic variance that was neutral under normal environmental conditions (Badyaev 2005). Individuals with increased heterozygosity may well possess the necessary diversity of alleles required to adequately cope with environmental stochasticity, this has been termed episodic heterozygote advantage (Samollow & Soulé 1983). This avenue of research has received limited attention to date, however, the magnitude of HFC effects has been shown to correlate positively with habitat fragmentation in Taita thrush (Lens et al. 2000), salinity tolerance in guppy at the population, but not individual, level (Shikano & Taniguchi 2002), and food limitation in common frogs (Lesbarreres et al. 2005). Studies using allozyme variation have revealed similar patterns (see, for example, Scott & Koehn 1990; Audo & Diehl 1995; Myrand et al. 2002).
We would advocate that the number of genotypes assayed be maximized, and would question the merit of future studies reporting HFCs detected with small numbers of microsatellite loci, given the lack of evidence that small marker sets have any power to infer genome-wide heterozygosity (Balloux et al. 2004; Slate et al. 2004; DeWoody & DeWoody 2005; Hansson & Westerberg 2008; Väli et al. 2008). One of the most hotly debated issues in HFC research at present is the relative contribution of local and genome-wide effects. This can only be resolved by studies assessing HFCs using large sets of markers. Furthermore, we would encourage authors to test the covariance in heterozygosity across markers, using the methods suggested by either Balloux et al. (2004) or Slate et al. (2004) in order to assess how well their marker set is likely to infer total genomic heterozygosity. A sobering conclusion is that, despite the very large amount of work in this area, the only factors that we have been able to find that explain variation in the strength of HFCs are methodological. Hence, our understanding of the biological reasons for variation in their strength remains poorly developed.