Identification of high-resolution microsatellites without a priori knowledge of genotypes using a simple scoring approach

Authors


Summary

  1. Large numbers of microsatellite loci for many species are increasingly made available from public data bases or through de novo isolation with next generation sequencing methods. However, the challenge remains to identify the subset of loci with suitable polymorphism for the populations of interest. This normally requires genotyping individuals in a representative subset of populations.
  2. Here, we present an approach that does not require individual genotypes, but instead uses the presence and absence of alleles in pooled DNA samples to estimate within-population variation (WPV, i.e. polymorphism) and among-population variability (APV, i.e. rare alleles). From these, we developed a scoring procedure to rank individual loci and thereby identify those most suitable for individual genotyping.
  3. We used multiplex PCR of 20 pooled individuals from each of 12 populations to screen a total of 80 published and unpublished microsatellite loci in the Northern pike Esox lucius, a species known to exhibit low genetic variability, thus predisposing it as a model species to test our approach. The screening of pooled samples revealed extremely different levels of microsatellite variability, ranging from 1·0 to 27·5 alleles per microsatellite (WPV) and from 1·4 to 12 populations per allele (APV). Scoring placed microsatellites with high WPV and medium APV at the highest ranks.
  4. We validated this outcome by genotyping individuals of two pike populations from geographically close areas. A strong differentiation potential (measured as Weir's & Cockerham's FST) as well as a better resolving power of higher over lower-ranked microsatellites (measured as Nei′s DA and probability of identity) proved that the method presented here is suitable for rapid and inexpensive identification of appropriate microsatellites without a priori knowledge of genotypes.
  5. Our scoring procedure is generally applicable to all population genetic studies based on microsatellites and is particularly recommended, if species with low genetic variability are to be investigated.

Introduction

Analysis of population genetic structures due to demographic processes requires selectively neutral genetic markers. As such microsatellites are widely applied. These single locus markers are primarily co-dominant and independently inherited and are highly informative due to their multi-allelic nature (Selkoe & Toonen 2006). In the absence of gene flow, new alleles may arise in a population due to mutation, which are unique for that population. These private alleles are particularly useful in the genetic analysis of populations, because of their high diagnostic potential.

Since microsatellites are present in the genome of most taxa (Neff & Gross 2001), researchers can easily identify and isolate almost any number of these markers by applying next generation sequencing (NGS) methods and specialized identification software (Gardner et al. 2011). Alternatively, published microsatellites may be used, which have the added advantage that they were already tested with the target species, thus circumventing problems that might be encountered with de novo isolation via NGS in some taxonomic groups (Schoebel et al. 2013).

Whatever approach is favoured, in many cases, it is still a critical issue to select those that possess appropriate genetic resolution for the populations under study. For example, NGS may provide lots of potential microsatellites, but only indirect information inferred from repeat type and length, concerning their potential variability in populations. Moreover, certain microsatellites may only be suitable for populations in one geographic area, but not in another due to pronounced genetic drift and mutation. In cases where gene flow (natural or human-assisted) has erased a substantial portion of genetic differences in populations in a limited geographic area, this would require high-resolving microsatellites for successful analysis (Ryman, Utter & Laikre 1995). Such powerful microsatellites will also be very valuable in identifying associated genetic units in landscapes that have become more and more fragmented due to human activities (examples given in: Reddy et al. 2012; Quéméré et al. 2010; Veit et al. 2005). For economic reasons, a systematic approach should be favoured over a trial-and-error method, especially if many populations with a sufficient number of microsatellites have to be analysed. At the same time, such a procedure will turn a rather subjective into a more standardized process allowing a more flexible and projectable analysis.

Mostly polymorphic microsatellites are preferred, because they provide many alleles to characterize a population. But at the same time, these alleles have to be different enough to allow efficient differentiation between populations (e.g. diagnostic alleles). And last but not least microsatellites must have a good analytical quality to result in reproducible and reliable genotypic data. The goal of the present study, therefore, was to develop a simple and generally applicable scoring system that considers these requirements, to select the most suited microsatellites for a given species without a priori knowledge of genotypes.

As a test species, we used the Northern pike Esox lucius. Populations of this fish generally exhibit extremely low levels of variation, a fact that makes studies of genetic relationships especially on small geographic scales still challenging (Senanan & Kapuscinski 2000; Hansen, Taggart & Meldrup 1999). We used pike samples from different river catchments of Germany (listed in Table 1) and performed genetic analysis with 80 microsatellites that were already available (details summarized in Table S1) and that were known to have different resolving power in different populations and geographic areas (Laikre et al. 2005; Miller & Kapuscinski 1996).

Table 1. Source water bodies of examined pike test populations
No.NameTypeCatchment
1EiderRiverEider
2HavelRiverElbe
3OderRiverOder
4MainRiverRhine
5UckerRiverUcker
6BoddenBrackishBaltic Sea
7DanubeRiverDanube
8Kleiner DöllnseeLakeElbe
9AmmerseeLakeDanube
10Großer Plöner SeeLakeSchwentine
11MüritzLakeElbe
12Lake ConstanceLakeRhine

Our scoring approach possesses two key components. First, DNA extractions were pooled in order to keep the analytical effort – polymerase chain reactions and fragment analysis – for the testing of many microsatellites at an acceptable level. Pools have been used earlier by Thomas et al. (2007) to detect signatures of selective sweeps in genome scans employing microsatellites and by Cryer, Butler & Wilkinson (2005) to identify polymorphic microsatellites for multiplex PCR. Both groups compared DNA profiles by eye, with Thomas et al. explicitly reporting that automatic scoring procedures failed. In addition to their work, we now provide a method that allows computation of profiles. As a second key feature of our approach, we used the presence and absence of alleles as simple computable criteria measuring variation within and among populations to allow an objective ranking of microsatellites.

The scoring results, obtained with pooled DNA, were evaluated with respect to the ranking of microsatellites with polymorphic and diagnostic alleles as well as high analytical quality. Finally, to validate that higher-ranked microsatellites possess a higher genetically resolving power than lower-ranked microsatellites, we compared genetic distances and probabilities of identity calculated from genotypic data obtained from individual DNA of two spatially close pike populations with different subsets of microsatellites.

Materials and methods

DNA extraction

Pike tissue samples were provided by commercial fishermen, fisheries authorities and research organizations. Frozen tissues were transferred to absolute ethanol and stored at room temperature as suggested in Eschbach (2012). For subsequent analysis, tissue samples of 20 pike individuals of each of 12 different populations from rivers and lakes in Germany (Table 1) were used. DNA was extracted from fin clips with a reverse solid phase extraction method in 96-well plates (nexttec™ DNA isolation system, Biozym, Hessisch Oldendorf, Germany) according to the manufacturer's instruction. Equal volumes of DNA extracts of 20 individuals per population were pooled without prior normalization of DNA concentrations.

PCR and fragment analysis

A total of 80 pike microsatellites were extracted from GenBank (NCBI), primary literature and through personal communication with other research organizations. Seventy-three were originally isolated from Esox lucius and seven from E. masquinongy. Further details including repeat type, primer sequences, accession numbers and references are compiled in Table S1. To determine the temperature range in which the microsatellites were specifically amplified, gradient PCRs were performed with DNA of a single pike individual. The Qiagen® Multiplex PCR Kit (Qiagen, Hilden, Germany) was used according to the ‘universal multiplex cycling protocol’ in the manufacturer's instruction employing the following reaction conditions: 95 °C for 15 min, 35 cycles of 94 °C for 30 s, temperature gradient of 55–65 °C for 90 s, 72 °C for 90 s, followed by a final step of 72 °C for 10 min. The kit consisted of a twofold concentrated master mix, to which 1 μl of DNA extract and unlabelled primers (listed in Table S1) in a final concentration of 0·2 μM were added. PCRs were performed with a Thermocycler T Gradient machine (Biometra, Goettingen, Germany) in a reduced reaction volume of 10 μl. For subsequent fragment analysis, PCRs were performed exactly as described above using 1 μl of pooled DNA extract, forward primers, 5′-labelled with the fluorescent dyes NED, 6-FAM or HEX and unlabelled reverse primers. Annealing temperatures applied were according to the results of the gradient PCR and are specified in Table S2. After PCR, 0·5 μl reaction solution was mixed with 9·25 μl HiDi™ formamide and 0·25 μl GeneScan™500 ROX™ Size Standard (Life Technologies, Darmstadt, Germany) and heat-denatured in 96-well plates. Electrophoresis was performed with an Applied Biosystems 3500xL Sequencer equipped with a 24-capillary array. Chromatograms were evaluated with GeneMapper® Software v4.1 (Life Technologies, Darmstadt, Germany) by identifying all microsatellite alleles and their respective sizes in each of the twelve populations. In doing so, a peak was accepted as an allele, if it shows a fluorescence intensity value of at least 500. Peaks that due to their appearance and position were clearly identified as stutters were omitted.

Scoring procedure

All detected alleles were written into a 1-0-matrix, in which ‘1’ indicates presence and ‘0’ absence of an allele in a population (Fig. 1). Within-population variation (WPV) was calculated by counting all alleles of a microsatellite in a population. Afterwards, the mean for all twelve populations was calculated. Among-population variation (APV) was calculated by counting the number of populations in which a given allele was found. To account for the higher diagnostic potential of rare alleles, the APV value was inverted, that is, an allele occurring only in one out of the twelve populations received the highest value of 12. Afterwards, the mean of the inverted values for all alleles was calculated. The quality index (QI) of the obtained chromatograms was weighted by judging the peak quality (2 = good: clearly shaped, no stutter, 1 = normal: minimal stutter, 0 = poor: could not be evaluated) and success of allele amplification from pooled DNA extractions of twelve populations (1·0 = alleles amplified in all 12 population, 0·75 = no alleles amplified in one population, 0·5 = no alleles amplified in two populations, 0 = no alleles amplified in three or more populations). The product of both values represented the QI. The final score was calculated by multiplying the means of WPV, invAPV and QI. To evaluate the scoring results with pooled DNA, an incremental evaluation was performed by determining the percentage of high, medium and low variable microsatellites as well as their average QI in each of 10 consecutively ranked microsatellites (Table S3 and Fig. 2; increment 7 was the only exception, because it contains the results of 12 microsatellites). Within-population variation >10 was considered high, 5–10 was medium and <5 was low. Among-population variability was considered high with >8, medium with 4-8 and low with <4 populations/allele.

Figure 1.

Example chromatogram (upper part) and scoring matrix (lower part) of microsatellite Elu87 and population 4 from river Main. Presence and absence of alleles is indicated in the matrix by 1 and 0, respectively. Within population variation (WPV) is calculated per population and as the mean of all populations (indicated part of the second last row). Among population variation (APV) is calculated per allele (third last column), inverted (invAPV) and the mean calculated over all alleles (indicated part of the second last column). The quality index (QI) is calculated as the product (last line in last column) of peak quality and success of amplification (lines 2 and 3 of the last column, respectively). The score was calculated as the product of the means of WPV, invAPV and QI (last row).

Figure 2.

Incremental evaluation of the scoring results. Each increment (x-axis) includes the results of 10 consecutively scored microsatellites, except increment 7 containing the results of 12 microsatellites (data extracted from table S3). Black, grey and white columns indicate the respective proportion (left yaxis) of high, medium and low variable microsatellites (= ms type) within populations in (a) and among populations in (b). The curves in (a) and (b) are identical and indicate the average quality of microsatellites per increment (right y-axis).

Validation

To test the results obtained by the scoring procedure with pooled DNA, DNA from individual pike of two populations was genotyped with 20 high-scored microsatellites. DNA was extracted from 21 individuals produced with parental pike from lake Müritz (Mecklenburg-Vorpommern, Germany; obtained from fishery Müritz-Plau GmbH) and 23 individuals derived from parental pike from lakes Bordesholmer See and Bothkamper See (Schleswig-Holstein, Germany; obtained from fishery Ostercappeln). Four Multiplex PCRs were conducted as described above (see ‘PCR & fragment analysis′), using 58 °C as annealing temperature for all reactions. Labelled and unlabelled primers were supplied for five microsatellites in each multiplex, as indicated in Table S4. Presence of null alleles and large allele drop-out was tested with Micro-Checker 2·2·3. Differentiation coefficient FST (Weir & Cockerham 1984) and genetic distance DA (Nei, Tajima & Tateno 1983) were calculated with Microsatellite Analyser (MSA 4·0·5). Probability of identity (PI), which describes the probability that two individuals share the same genotype at a given locus, was calculated over all individuals of the two populations with Cervus 3·0 (Kalinowski, Taper & Marshall 2007). DA and PI were determined with different subsets of 10 out of the 20 microsatellites used for genotyping, with the first subset consisting of the 10 highest-ranked microsatellites ranging from rank one to 10. The second subset comprised the next ten microsatellites starting at rank two, thus ranging from rank two to 11, etc., until the 10 lowest-ranked microsatellites were grouped in subset 11 (see Table S4 for ranking information of the 20 microsatellites used).

The whole procedure of microsatellite selection is additionally summarized in a flowchart (Figure S1), which may serve as a guideline.

Results

Fragment analysis

After pre-testing of 80 microsatellites with gradient PCR, eight loci yielded no or unspecific PCR products over the whole temperature gradient applied and were excluded from further analysis. With 72 microsatellites, fragment analysis was performed with DNA pools extracted from 20 individuals of each of twelve pike populations from different German water bodies listed in Table 1. Experimentally determined allelic size ranges and the total numbers of alleles were mostly different from values reported in the literature (details summarized in Table S2).

Within-population variation was 8·9 ± 5·4 alleles per locus (mean ± standard deviation; CV = 60·2%) averaged over all 72 microsatellites and all 12 pike populations (range 1·0–27·5 alleles) (Table S3). Among-population variation was 7·5 ± 2·1 populations per allele (CV = 27·6%) averaged over all alleles of the 72 microsatellites (range 1·4–12 populations). The mean QI of all microsatellites was 1·5 ± 0·8. The score was calculated from the above-mentioned three components (using the inverted values of APV: see methods for details) and ranged from zero to 238.

Scoring

Based on the ranking results (Table S3), we determined the respective percentages of high, medium and low variable microsatellites per increment of 10 consecutively ranked microsatellites (see methods for details). This was carried out with respect to WPV and APV and is depicted in Fig. 2a,b, respectively, together with the quality measure (QI). Looking at WPV (Fig. 2a), the scoring procedure ranks high variable microsatellites in the highest positions (= increments 1 and 2), followed by medium and finally by low variable microsatellites. With respect to APV (Fig. 2b), medium variable microsatellites are ranked highest (= increments 1–3). Afterwards, high variable microsatellites dominate (= increments 4 and 5) and low variable microsatellites were ranked lowest (= increments 6 and 7). Microsatellites, which exhibited the best analytical quality, were ranked highest (QI = 2·0 in increment 1), while those with poorer quality were ranked lower.

Validation

Selection of the microsatellites was primarily performed with respect to their ranking position. But, since we also had to take PCR and fragment analysis specific requirements into account (i.e. annealing temperature and sizes of PCR products), the 20 microsatellites selected for multiplex PCR finally came from ranks one to 32 (see Table S4). These were combined in four PCRs amplifying five loci each. All microsatellites yielded clearly distinguishable allele peaks with DNA from pike individuals of the two test populations (see methods for details of the populations). No large allele drop-out and only one possible null allele were detected. Total number of alleles across all loci was 116 and 170, respectively, revealing at least 54 alleles that were exclusive for one or the other population (i.e. diagnostic alleles). Mean numbers of alleles per locus were 5·8 ± 2·5 and 8·5 ± 3·8, respectively, indicating different degrees of polymorphism in the two populations (further detailed in Table S4). The FST value of 0·113 was highly significant for the two populations (P < 0·001). Genetic distances (Nei′s DA) calculated additionally with data from subsets of 10 microsatellites (see methods for details of these calculations) were highest with the first subset comprising the 10 highest-ranked microsatellites. The values decreased when calculated with the next four subsets and then remained at a more or less constant level when performed with successively lower-ranked subsets of microsatellites (Fig. 3). Applying the same subsets, we found that PI was generally low and exhibited a constant increase when calculated with successively lower-ranked subsets of microsatellites (Fig. 3).

Figure 3.

Genetic distances calculated as Nei′s DA (y-axis of upper graph) and probability of identity PI (y-axis of lower graph) with different subsets of 10 out of the 20 microsatellites (x-axes) used for genotyping. A lower subset number indicates a subset with higher ranked microsatellites, i. e. subset one includes the 10 highest and subset 11 the 10 lowest ranked microsatellites. R2 gives the regression coefficients of polynomial and exponential regressions for DA and PI, respectively.

Discussion

The aim of this study was to develop a rapid method to identify suitable microsatellites for population genetic investigations when large numbers of loci are available but only a subset is desired for genotyping. Sufficient amounts of microsatellites may either be available from public data bases and other sources (as in our case) or can easily be developed de novo (Schoebel et al. 2013; Gardner et al. 2011). As a test case for our study, we have chosen pike Esox lucius, because of its low genetic variability, which makes population genetic analysis of this species still a challenging endeavour (Senanan & Kapuscinski 2000; Hansen, Taggart & Meldrup 1999). Although many microsatellites have already been developed for pike (Table S1), not all of these are equally suited for populations from different geographic areas (Laikre et al. 2005; Miller & Kapuscinski 1996). Our own experimental data with 80 microsatellites indicated large deviations in allele sizes and total numbers of alleles compared with published data (Table S2) as well as high coefficients of variation in within- (WPV) and APV of 12 German pike populations (Table S3). This demonstrated that it is indeed worth it to screen a given repertoire of microsatellites for those that are best suited for a population genetic study.

Most population genetic calculations are based on allele frequencies, which have also been used to rank genetic markers (e.g. with the software WHICHLOCI developed by Banks, Eicher & Olsen 2003). However, this requires a priori knowledge of individual genotypes. Here, a different approach was used, by pooling the DNA of populations, which does not allow calculating allele frequencies. As an alternative, the presence and absence of alleles in these populations were determined and evaluated in a matrix based approach. The presence–absence selection of alleles can be assumed to be more stringent than selection of differences in frequencies of alleles, because it focuses on the identification of the most powerful alleles to differentiate between populations, that is, diagnostic or private alleles. Other advantages of this approach were, first, that only 864 instead of 17,280 PCRs were necessary to test 72 microsatellites from 12 populations. This number is of course variable, depending on the individual research context, which might require different numbers of populations and microsatellites to be screened. It will, however, always be considerably lower compared with numbers required for genotyping of individuals. Secondly, defined computable and quantifiable values were obtained, allowing objective ranking of microsatellites based on their polymorphic and diagnostic traits as well as their analytical quality (Table S3). Thirdly and most important, a priori knowledge of genotypes is not required.

Using DNA pools bears the potential risk of losing particularly large and rare alleles. Comparison of 450 genotypes of randomly chosen individuals, which were part of the pools, indeed revealed that approx. 4% of alleles were not detected in the pool profiles (data not shown). On one hand, this clearly shows that there is room for improvement, which can be achieved, for example by standardization of DNA concentrations prior to pooling (Thomas et al. 2007) or by pooling DNA of less individuals (Cryer, Butler & Wilkinson 2005). On the other hand, this means an increase in time and effort while most likely yielding a similar outcome as all microsatellites are affected by the same issue. The subsequently discussed results obtained with our scoring approach – particularly with the validation part of the process – argue in favour of such an assumption.

Our findings with pooled DNA showed that microsatellites were scored in higher ranks, if they were high variable within and medium variable among populations (Fig. 2). Despite the fact that low variability among populations means an even higher diagnostic power and should therefore be preferred by the scoring procedure, those microsatellites were placed at the end of the ranking table, because of their low analytical quality (QI = 1·4 - 0). This demonstrated that the scoring procedure worked effectively by accumulating microsatellites into the high score that have a high genetically resolving potential, due to their polymorphic nature and diagnostic value, while at the same time warranting a high analytical quality. These results were confirmed by genotypic data obtained with 20 high-ranked microsatellites from individuals of two spatial close pike populations. A high total number of alleles and a high average number of alleles per locus showing medium to high polymorphism as well as a high proportion of diagnostic alleles indicated that the selected microsatellites should properly resolve the two test populations. This was proven by a highly significant differentiation coefficient (FST = 0·113; P = 0·0001). Finally, although not tested with all 72 microsatellites, the decrease in the genetic distance value DA (Nei, Tajima & Tateno 1983) and the increase in the probability of identity of genotypes, when using subsets with lower-ranked microsatellites (Fig. 3), clearly indicated that the scoring system worked very effectively, by placing microsatellites with higher resolving power at higher ranks of the scoring table.

In conclusion, the scoring system presented here allows for a quick selection of microsatellites based on clearly defined computable criteria facilitating an objective ranking without a priori knowledge of genotypes. It is generally applicable for genetic studies based on microsatellites and is particularly recommended, if high genetic resolution of populations, which were known or expected to exhibit low genetic differentiation, is required.

Acknowledgements

Particular thanks go to Michael Monaghan and his Popgen group for critically reviewing and discussing our manuscript. We also like to thank Arne Nolte, Klaus Kohlmann and Robert Arlinghaus for helpful discussions on an earlier version of the manuscript. Christian Schomaker is especially acknowledged for his support in sampling pike. Finally, we would like to thank three anonymous reviewers for their critical comments, which helped a lot to improve the manuscript. Funding of the current work was granted by the German Ministry of Education and Research within the project Besatzfisch (www.besatz-fisch.de) in the Program for Social–Ecological Research (Grant No. 01UU0907).

Ancillary