NEWS AND VIEWS
Do microsatellites reflect genome-wide genetic diversity in natural populations? A comment on †
Marcus Ljungqvist, Fax: 46462224719; E-mail: firstname.lastname@example.org
A recent study by Väli et al. (2008) highlights that microsatellites will often provide a poor prediction of the genome-wide nucleotide diversity of wild populations, but does not fully explain why. To clarify and stress the importance of identity disequilibrium and marker variability for correlations between multilocus heterozygosity and genome-wide genetic variability, we performed a simple simulation with different types of markers, corresponding to microsatellites and SNPs, in populations with different inbreeding history. The importance of identity disequilibrium was apparent for both markers and there was a clear impact of marker variability.
A main aim in conservation genetics is to unravel the link between genetic diversity and population viability (Frankham et al. 2002). To reach this goal we need to understand the causes and consequences of genetic variation in natural populations, which requires transparent methods to quantify genetic diversity within and between populations. On a genome-wide scale, genetic variation is distributed over thousands of genes and non-coding genomic regions. Ideally, evaluations of the genetic status of populations would include data on every aspect of genetic variation, including (i) the quantitative genetic variation of adaptive traits (Reed & Frankham 2001, 2003); (ii) the molecular genetic variation at critical functional loci (Hughes 1991; Miller & Lambert 2004); and (iii) the genetic variation over genome-wide distributed coding and non-coding loci (Hansson & Westerberg 2002; Slate et al. 2004). For practical reasons, we are still unable to quantify this diversity of genetic variation in studies of natural populations. Instead, we need to rely on proxies of the overall genetic variation, often inferred from a limited set of molecular markers. Microsatellites have been the marker of choice in many conservation genetic studies over the last decade (Frankham 1995; Beaumont & Bruford 1999; Coltman & Slate 2003).
A recent study by Väli et al. (2008) highlights that microsatellites will often provide a poor prediction of the nucleotide diversity elsewhere in the genome of wild populations, which can have important implications for conservation genetic studies. Väli et al. (2008) investigated to what extent a set of microsatellites (10–27 loci) explained the nucleotide diversity at 10 intronic loci (in total, approximately 5.5 kb) within and between eight mammalian populations of four species; grey wolves (Canis lupus), coyotes (Canis latrans), Eurasian lynx (Lynx lynx) and wolverine (Gulo gulo). In brief, they found a strong positive correlation (r2 = 0.70) between microsatellite heterozygosity and nucleotide diversity at the population level, but very weak correlations at the individual level within populations (r2 ≤ 0.09). Also, they noted that estimates of nucleotide diversity varied 30-fold (7.1 × 10−5–2.1 × 10−3) between populations, whereas microsatellite multilocus heterozygosity showed a 1.4-fold difference (0.54–0.78). They concluded that variability at microsatellite markers does not necessarily reflect the underlying genomic diversity in the wild. Similar conclusions have been reached previously (e.g. Hedrick 2001; Pemberton 2004; Slate et al. 2004) and we share the view that multilocus heterozygosity could be a poor predictor of genetic diversity in many population scenarios (Hansson & Westerberg 2002).
However, the reasons for why microsatellite data are often poor predictors of genetic variation in natural populations are not fully explained in Väli et al. (2008). Several important explanations are thoroughly discussed, e.g. the fundamental difference in the underlying mutation processes of repetitive and non-repetitive DNA (Hedrick 2001; Ellegren 2004) and the ascertainment bias introduced by selecting the most polymorphic microsatellite loci (Pardi et al. 2005; Brandström & Ellegren 2008), while other explanations are not vetted. Most importantly, what is not clearly pointed out is that an initial requirement for any correlation between multilocus heterozygosity and genome-wide genetic variability is the existence of variation in the degree of genome-wide diversity within the population (or between populations if more than one population is sampled). In other words, a requirement for genome-wide heterozygosity–diversity correlations is that the population shows ‘identity disequilibrium’, i.e. non-random associations of diploid genotypes between loci (Weir & Cockerham 1969; Chakraborty 1981; Lynch & Walsh 1998; Hansson & Westerberg 2002). Identity disequilibrium may arise for several reasons, for example, due to partial inbreeding (matings among kin or self-fertilization), admixture and genetic drift.
Väli et al. (2008) suggested that single nucleotide polymorphisms (SNPs) and resequencing approaches are superior to microsatellites (with their unique mutation patterns and potential ascertainment bias) in predicting genetic diversity in the context of conservation genetics. In this commentary, we would like to clarify that this is not necessarily true, since correlations between multilocus heterozygosity and genome-wide variability are not expected at identity equilibrium, regardless of marker choice. Moreover, in the presence of identity disequilibrium, previous studies have shown that heterozygosity at more variable markers, such as microsatellites, will provide a much stronger prediction of genome-wide genetic diversity, than heterozygosity at markers with little variation, such as SNPs (Slate et al. 2004; see also Csilléry et al. 2006); and we would like to highlight that conservation genetic studies using SNPs will require a substantial effort in terms of number of genotyped loci.
Results and discussion
To clarify and to stress the importance of identity disequilibrium and marker variability for correlations between multilocus heterozygosity (MLH) and one measure of genetic variability, genome-wide heterozygosity (GWH), we performed a simple simulation where heterozygosity at two types of markers, corresponding to microsatellites and SNPs, respectively, were tested for correlation with GWH in populations with different levels of identity disequilibrium (Box 1). By simulating different levels of inbreeding we generated three populations (A–C) that differed substantially in the degree of identity disequilibrium and GWH. At one extreme, a largely outbred population [population A; average (range) inbreeding coefficient, f = 0.012 (0–0.13)] and, at the other extreme, populations harbouring both highly inbred (genome-wide homozygous) and outbred (genome-wide heterozygous) individuals [population B: f = 0.095 (0–0.32); population C: f = 0.16 (0–0.74); Box 1]. Thus, our three simulated populations differed in the degree of variation in the inbreeding coefficient [σ2(f) = 7.0 × 10−5, 6.2 × 10−3, 2.7 × 10−2, respectively] and genome-wide heterozygosity [σ2(GWH) = 1.9 × 10−3, 2.3 × 10−3 and 4.1 × 10−3, respectively]; consequently, they differed in the degree of identity disequilibrium.
The importance of identity disequilibrium for generating correlations between MLH and GWH was apparent for both microsatellites and SNPs (Box 1). Population A with weak identity disequilibrium, exhibited no correlation between MLH and GWH (r2 < 0.01; Box 1), whereas the populations with stronger identity disequilibrium showed moderate to strong correlations between MLH and GWH (population B: r2 = 0.01–0.19; population C: r2 = 0.31–0.54; Box 1). Consequently, only when there is substantial identity disequilibrium, MLH at different sets of loci will correlate strongly to genetic variability. In line with this reasoning, Väli et al. (2008) found a clear correlation between microsatellite MLH and nucleotide diversity (r2 = 0.70) when all eight mammalian populations were analysed simultaneously (see fig. 1a, b in Väli et al. 2008). These populations show striking differences in their population history (Väli et al. 2008), which lead to substantial identity disequilibrium in the merged sample. Strong correlations have also been found in other between populations studies, e.g. in Atlantic salmon (Salmo salar) with r2 = 0.42 for microsatellite and SNP heterozygosity (Ryynänen et al. 2007). Furthermore, in Väli et al.’s study, the strongest within-population correlation between microsatellite and SNP heterozygosity was noticed in the Scandinavian wolf population (see fig. 2a in Väli et al. 2008). This wolf population is characterized by frequent incestuous inbreeding and occasional outbreeding due to immigration (Vila et al. 2003; Bensch et al. 2006) and as a consequence pronounced partial inbreeding and identity disequilibrium (Liberg et al. 2005; Bensch et al. 2006). We do not know whether the other populations (showing no correlation between microsatellite and SNP heterozygosity) in Väli et al. (2008) are close to identity equilibrium, but if they are, no or very weak correlations between marker heterozygosity and genome-wide variability are expected (see population A; Box 1).
There was a pronounced impact of marker variability in our simulated data sets: the microsatellite-based MLH provided a much stronger prediction of GWH than did SNP-based MLH (Box 1). For example, in population C, the population with the highest variation in inbreeding, the proportion of variance in GWH explained by microsatellite MLH was 0.54, whereas the corresponding value for the SNPs was 0.31 (Box 1). Similar results have been found in previous simulation studies (Slate et al. 2004). Therefore, to study one aspect of the genetic variation, namely the genome-wide variation, these results would suggest the use of highly variable markers like microsatellites, unless a substantially higher number of SNPs can be screened to compensate for their low variability. Following Slate et al. (2004; equation 4), we estimated that approximately five times more SNPs (with 38% heterozygosity) than microsatellites (with 78% heterozygosity) would be required to achieve similar correlations between MLH and f, where f is used as a proxy of GWH, even in the most genetically variable of our populations (Box 2). The high number of SNPs necessary to achieve sufficient power certainly needs to be taking into account when designing conservation genetic studies. However, recent molecular techniques and bioinformatics tools make SNPs increasingly accessible and screenable compared to just a few years ago (e.g. Ryynänen et al. 2007; Kenta et al. 2008; Stapley et al. 2008; Slate et al. 2009), which may compensate for the need of genotyping larger numbers of SNPs.
Finally, we would like to point out that other aspects of the genetic variation than the genome-wide variability may be preferentially studied at the sequence level rather than with highly variable markers, such as microsatellites. It has been suggested that the genetic variation at specific, critical genes, such as the major histocompatibility complex, should be the focus in conservation genetic studies (Hughes 1991; Miller & Lambert 2004). Resequencing approaches of the target genes per se, using techniques described by Väli et al. (2008), would be preferable as they will provide direct data of the amount of genetic variation for these potentially important genes and it would also make it possible to evaluate past and ongoing selection at the molecular level by using various sequence-based genomic approaches (Nielsen 2005; Nielsen et al. 2005). The amount of genetic variation at specific genes can also be studied through correlated patterns at linked markers, caused by hitch-hiking processes. This is done with outlier approaches, where a large number of randomly selected markers are typed and then evaluated for deviances from neutral expectations to find spots of importance for selection and adaptation (Hemmer-Hansen et al. 2007; Wood et al. 2008). In addition, in genetic model organisms for which genetic maps are available it is also possible to study the degree of genetic variation and other signatures of selection on a local chromosomal scale by genotyping linked markers; e.g. selecting markers that are tightly linked to genes of particular interest (Sutter et al. 2007). It is likely that SNPs will in most circumstances provide a better prediction of the amount of genetic variation at tightly linked genes than will microsatellites, because the high mutation rates of microsatellites may lead to allelic heterogeneities and decay of linkage disequilibrium to critical loci (Hästbacka et al. 1992; Kruglyak 1999; Slatkin 2008). An exception could be when selection has been very recent. Then, variable markers, such as microsatellites, may provide superior signal of linked genetic diversity compared to SNPs.
Rapid advancements in molecular biology and bioinformatics during the last years make it possible to carry out large-scale genotyping of markers and functional genes in large population data sets (Ellegren & Sheldon 2008). This is good news since the structure and demographic history of populations can be studied with increasing precision, potentially by using a combination of microsatellite, SNP and other types of genetic data, which may provide crucial knowledge to increase our understanding of the role and importance of genetic variation in conservation biology.
Box 1. The correlation between MLH and GWH in populations with different degree of variation in GWH using microsatellites and SNPs
Individual-based population simulations were conducted in MatLab 7 (MathWorks, version 126.96.36.1997, R2009a). Individuals in the founder population were created by randomly drawing alleles from specified allele frequency distributions. Populations were run for 50 generations and different population scenarios were implemented to create populations with the same number of individuals (500), but with different variance in the inbreeding coefficient (f) and GWH [population A: σ2(f) = 7.0 × 10−5, σ2(GWH) = 1.8 × 10−3; population B: σ2(f) = 6.2 × 10−3, σ2(GWH) = 2.3 × 10−3; population C: σ2(f) = 2.7 × 10−2, σ2(GWH) = 4.0 × 10−3]. A total of 200 independently segregating loci were modelled, of which 100 bi-allelic SNPs represented genome-wide loci and 50 bi-allelic SNPs and 50 penta-allelic microsatellites represented markers. In outbred individuals (f = 0), the genome-wide SNPs were selected to have an expected mean heterozygosity of 32%, the marker SNP a mean of 38% and the microsatellite markers a mean of 78%. In the end of the simulation, each individual was scored for GWH and SNP- and microsatellite-heterozygosity. From pedigree information of each population we calculated each individual’s f. The main results are that the correlation between MLH and GWH increases with increasing variation in f and GWH and with the variability of the markers. These patterns remained when different numbers of loci were used to calculate GWH (50 and 500, respectively; results not shown).
Box 2. The association between the proportion of variance in the inbreeding coefficient (f) explained by mean MLH and the number of markers used to calculate MLH for SNPs and microsatellites in three different population scenarios
We used equation 4 in Slate et al. (2004) to estimate the proportion of variance in f explained by (i) SNPs markers with a heterozygosity level of 38% (dashed lines); and (ii) microsatellite markers with 78% heterozygosity (solid lines), respectively, for the three populations [A (triangles), B (squares) and C (diamonds)] modelled in Box 1]. The main result is that the proportion of variance in f explained by MLH increases with increasing number of markers, increasing marker heterozygosity and increasing genetic variation in the population. To achieve, for instance, an r2-value of 0.5 between f and MLH in these populations, approximately 4.4–5.6 times more SNPs than microsatellites would be required.