SSR‐seq: Genotyping of microsatellites using next‐generation sequencing reveals higher level of polymorphism as compared to traditional fragment size scoring

Abstract Microsatellites (or simple sequence repeats, SSR) are widely used markers in population genetics. Traditionally, genotyping was and still is carried out through recording fragment length. Now, next‐generation sequencing (NGS) makes it easy to obtain also sequence information for the loci of interest. This avoids misinterpretations that otherwise could arise due to size homoplasy. Here, an NGS strategy is described that allows to genotype hundreds of individuals at many custom‐designed SSR loci simultaneously, combining multiplex PCR, barcoding, and Illumina sequencing. We created three different datasets for which alleles were coded according to (a) length of the repetitive region, (b) total fragment length, and (c) sequence identity, in order to evaluate the eventual benefits from having sequence data at hand, not only fragment length data. For each dataset, genetic diversity statistics, as well as F ST and R ST values, were calculated. The number of alleles per locus, as well as observed and expected heterozygosity, was highest in the sequence identity dataset, because of single‐nucleotide polymorphisms and insertions/deletions in the flanking regions of the SSR motif. Size homoplasy was found to be very common, amounting to 44.7%–63.5% (mean over all loci) in the three study species. Thus, the information obtained by next‐generation sequencing offers a better resolution than the traditional way of SSR genotyping and allows for more accurate evolutionary interpretations.

(SNPs) or insertions/deletions (indel) polymorphisms in the nucleotide sequence of that fragment, either within the repetitive array or in the flanking regions (FR), remain undetected by length assessment alone. Moreover, indels in the flanking regions might be incorrectly confounded with size mutations of the SSR. Thus, using only length information, SSR alleles may appear identical in state (i.e., length/ size), but actually they are not necessarily identical by descent in case of convergent mutation(s) to the same size ("size homoplasy", Estoup, Jarne, & Cornuet, 2002) or variability only in sequence but not in size. Estoup et al. (2002) used the term "molecularly accessible size homoplasy" to refer to the fraction of homoplasy that can be resolved by sequencing, which is only a subset of the size homoplasy that actually occurs at microsatellite loci. Still, sequencing cannot resolve homoplasy that arises from the convergence of two alleles to the same sequence and length.
While using NGS data from different sequencing platforms for SSR marker development in non-model plant species is now a common practice (Weising, Wöhrmann, & Huettel, 2015), NGS is very rarely used for SSR scoring. In order to tackle the homoplasy problem and assess FR variation, some authors combined cloning or single-strand conformation polymorphism and sequencing (e.g., Germain-Aubrey et al., 2016;Lia, Bracco, Gottlieb, Poggio, & Confalonieri, 2007;Ortí, Pearse, & Avise, 1997;Šarhanová et al., 2017), but these methods are costly, time-consuming, and not easily applicable for polyploids.
There are first forays among animals (Bradbury et al., 2018;De Barba et al., 2017;Vartia et al., 2016), but comparisons between traditionally scored fragment length data and the information obtained from sequencing are still missing.
Mutations in the SSR region (predominantly changes in the number of repeats) and FR (indels and SNPs) evolve at different rates: the fast-evolving repetitive region shows a mutation rate of about 10 −6 to 10 −2 per locus per generation (Schlötterer, 2000), whereas base substitutions occur at a much slower rate (depending on the genome size of the organism; Lynch, 2010), for example, in Arabidopsis thaliana at a rate of 7 × 10 −9 mutations per nucleotide position per generation (Ossowski et al., 2010). The combined information from both regions can thus be used for the inference of evolutionary events at different timescales or at least indicate possible mutational saturation of the SSR region or its convergent evolution to the same size.
Here, an NGS strategy is described which allows to genotype hundreds of individuals at several, custom-designed SSR loci simultaneously, using multiplex PCR and barcoded primers to separate individual-specific Illumina sequence reads. Our objectives were (a) to generate nucleotide sequence data of several non-model plant species, for which prior genomic data did not exist, from both the SSR and the flanking regions, (b) to record the length of the repetitive region, as well as SNP and indel variation within the SSR and the FR, (3) to estimate the amount of molecularly accessible size homoplasy of each locus, and (4) to compare the degree of genetic variability between different datasets based on the number of repeat units, fragment length, and sequence identity.

| Study species
The method described here is based on three angiosperm species from southern South America: Donatia fascicularis (Stylidiaceae), Mulguraea tridens (Verbenaceae), and Oreobolus obtusangulus (Cyperaceae) (Table 1, Figure 1). In total, 859 individuals were genotyped at 58 nuclear SSR loci and statistically analyzed. For detailed information about two of the studied species (D. fascicularis and O. obtusangulus) and population sampling see results published elsewhere (Pfanzelt, Albach, & von Hagen, 2017;S. Pfanzelt, P. Šarhanová, D. C. Albach, & K. B. vonHagen, under review). The work is a part of a study that includes five further angiosperm species of a wide phylogenetic range, different ploidy levels, genome size, and reproductive system (Astelia pumila, Asteliaceae; Berberis empetrifolia, Berberidaceae; Chuquiraga aurea, Asteraceae; Guaiacum sanctum, Zygophyllaceae; Rubus ulmifolius agg., Rosaceae; in total, about 2,000 individuals were scored at 132 SSR and 3 chloroplast loci), although the data of these latter species are not included in the present study.

| SSR identification, primer design, and testing
Initial detection of SSR loci relied on assembled Illumina paired-end sequencing reads of cDNA transcripts, which in turn stemmed from RNA extracted from fresh plant or RNA-later (Qiagen) treated material using the RNeasy Mini Kit (Qiagen). RNA extraction followed a standard protocol and included subsequent DNA digestion and an RNA re-precipitation step. Libraries were prepared and sequenced on an Illumina HiSeq 2000 platform according to the manufacturer's instructions, using a TruSeq RNA Library Prep Kit v2 and 10% of the lane per library. De-novo assembly of RNA-Seq output data was done in Geneious 6.0.4 (Kearse et al., 2012) with medium sensitivity settings.
The resulting contigs from the de-novo assembly were screened for SSRs using Phobos 3.3.12 (as a plugin in Geneious; Mayer, 2010).
The predefined repeat unit length was 3-6 bases (to avoid frequent PCR stuttering, SSRs with dinucleotide repeats were not considered; Miller & Yuan, 1997). The minimum length of the microsatellite region was at least 21 bp. One of the primers from a given primer pair was always selected to be close to the SSR to ensure that during SSR analysis, a single NGS read contains the entire repetitive region, thus assuring correct merging of paired reads. The target length of the amplicon was up to 450 bp.
Ninety-six primer pairs per species were designed employing Primer3 (as a plugin in Geneious; Rozen & Skaletsky, 2000) with the following settings: product size 250-400 bp; primer size = 18-22-30 bp (min-optimal-max); melting temperature (Tm) = 68-70-72°C; GC content = 40%-50%-60%; maximum T m difference = 2°C; remaining settings as default. Of these originally 96 primer pairs, 68 successfully produced amplicons for D. fascicularis, 51 for M. tridens, and 61 for O. obtusangulus. Amplicons of four individuals per species (different species and all loci were pooled prior library preparation generating four pools) were sequenced on an Illumina MiSeq platform (2 × 250 bp paired-end, using MiSeq Reagent Kit v2 and 25% of a lane per library), following Meyer and Kircher (2010) and omitting fragmentation. Sequencing adapters were removed using cutadaPt (Martin, 2011;minimum length 150, quality 15). Data were checked for read pairs in readtrimmchecker (Beier, 2016). Assembly was done in clc assembly cell 4.2.0 (using the cl c_overl ap_reads command, minimum overlap of 30), and contigs were imported into Geneious. Based on intraspecific variability, up to 20 SSR loci (hereafter called target SSRs) for each species were selected. Note. Taxonomic and genome information on the three studied species with number (N) of individuals and loci used. Ploidy levels of D. fascicularis and O. obtusangulus were inferred from chromosome number (4×) and from number of alleles per locus (2×); ploidy level of M. tridens was estimated from number of alleles per locus and through flow cytometry, using genome size as a proxy.

| Barcoding of primers and multiplex PCR
a Individuals with missing data in more than 30% of the loci were excluded from statistical analyses.
allow for multiplexing during library construction, ten-nucleotide barcode sequences, specific for each SSR locus and each sample set of 96 individuals, were appended to the 5ʹ-ends of both prim-

ers (forward and reverse) of the target SSRs (Supporting Information
Appendix S1). In total, 2 (forward and reverse) × 20 (loci) × number of sets (1 or 4) primers per species have been ordered. This double-tagging allowed parallel sequencing of several conspecific samples through pooling after PCR (for a graphical description of the method see Figure 2). multiPlX 2.1 (Kaplinski, Andreson, Puurand, & Remm, 2005) was used to define primer groups within each of the sample sets in order to identify optimal primer compatibility and to avoid undesired primer pairing. The grouping was done for each species and barcoded primer set separately. The software was run with the default settings, and "Calculating scores" was set to "primer-primer any". Each multiplex group consisted of 2-5 loci (Supporting Information Appendix S2), as multiPlX 2.1 did not consider more loci to be appropriate for multiplexing due to the risk of primer dimerization.
Multiplex PCR reactions were tested on four individuals per species (amplicons of different species and all loci were pooled prior to library preparation, generating four pools) and sequenced on an Illumina MiSeq platform as described above, using 5% of a lane per library. Raw

| Illumina paired-end sequencing of SSR amplicons
The 96

| Analysis pipeline
The data analysis pipeline included quality control, read merging, demultiplexing, de-novo assembly, and the construction of reference alignments. These steps are described in detail in the following. trimGalore 0.3.7 was used for adapter clipping and Pear v0.9.5 (Zhang, Kobert, Flouri, & Stamatakis, 2014) for merging of pairedend reads (setting the p-value threshold for the statistical test to the strictest value, i.e., 0.001) and quality trimming (quality score threshold of 30). Demultiplexing was done using the Perl script fastx_barcode_splitter.pl from the FastX Toolkit. The respective barcode file contained the specific 10 bp tag and the first 10 bp of the primer sequence, so a total length of 20 bp had to be matched. Two mismatches/partial matches were allowed. Forward and reverse merged reads from the split output carrying the same tags were subsequently concatenated. De-novo assembly was done using caP3 (version date 12/21/07; Huang & Madan, 1999), with the overlap percent identity cutoff set to 99 and the maximum gap length in any overlap set to 2. The resulting contigs (specific for each individual and each locus, for all species) were imported into Geneious 6.0.4, and a multiple alignment (consensus alignment, with the threshold set to 90%) was done together with a reference sequence (original sequence from cDNA transcripts) of the respective locus and sample set barcode in order to check for mis-tagging. Contigs were visually checked, and if variability was still present within a contig (indicating that caP3 assembled two alleles into one), de-novo assembly was repeated in Geneious (setting the maximum mismatches per read to 1%).
Allele sequences (without tags and primer sequences) are deposited at NCBI GenBank (accession numbers MG322761-MG323307).

| Size homoplasy
The amount of size homoplasy was calculated as the ratio of the number of fragment length classes containing alleles with different sequences and the total number of fragment length classes. This was done for each species and locus separately.

| Ploidy level estimation
The using genome size as a proxy (data not provided).

| Statistical analyses of population genetic structure and diversity
Data analyses were based on three different datasets, for which alleles were coded according to (a) length of the repetitive region (SSRlength dataset), (b) total fragment length (fragment-length dataset), and (c) sequence identity (sequence-identity dataset).
sPaGedi 1.5a (Hardy & Vekemans, 2002) was used to calculate the total number of alleles N A , gene diversity H e (corrected for sample size, Nei, 1978), and observed heterozygosity H o , as well as global F-and R-statistics for each of the three datasets separately (see above). To estimate the effects of the infinite allele (IAM; Kimura & Crow, 1964) versus stepwise mutation (SMM; Ohta & Kimura, 1973) models, we compared F ST versus R ST of the SSR-length and fragmentlength datasets and tested whether observed R ST was significantly higher than its value after permutation. P-values were obtained after 10,000 permutations of allele sizes among alleles within loci to test the null hypothesis that stepwise mutations do not contribute to genetic differentiation (Hardy, Charbonnel, Fréville, & Heuertz, 2003).
Additionally, we scored the number of variable sites, that is, SNPs F I G U R E 3 Selected coverage graphs of Mulguraea tridens individual Mt-033d. Exemplarily shown are the four loci mt10760, mt17340, mt17642, and mt24277. Read coverage of contigs is scaled to 1. Contigs are numbered according to read coverage, that is, the contig with the most reads is numbered contig 1. Heterozygosity and allele dosage become apparent when comparing relative read numbers: Individual Mt-033d is heterozygous at loci mt10760 (allele dosage 3:1), mt17340 (2:1:1), and mt24277 (2:2) and homozygous at locus mt17642

| Output statistics
Total number of raw reads from the Illumina MiSeq run was 41,990,310 (containing all eight species). Of these, 97.7% could be unambiguously assigned to the respective libraries based on sequencing adapters. Raw read numbers per library averaged 213,594 ± 57,455 (mean ± SD; forward and reverse libraries yet unmerged). Pear successfully merged 99.6% of all read pairs passing quality control (mean over all 96 libraries), so that the total number of merged reads was 20,401,853 (i.e., 2.8% of the raw read output either did not pass quality control or lacked its respective mate).
With regard to the three species studied here, all loci could be recovered by demultiplexing, but the average number of reads per locus (within one species) differed among loci by up to three orders of magnitude (read coverage threshold ≥10). Five and seven loci of D. fascicularis and O. obtusangulus had low coverage (<10 reads per allele) in more than 10% of the individuals or contained putative null alleles and were excluded from all analyses. One locus of D. fascicularis and one of M. tridens contained highly divergent allele sequences, suggesting the existence of two or more paralogous copies. These loci were also excluded. Locus mt11151 of M. tridens contained two different repetitive regions and was separated into two loci in the SSR-length dataset. Individuals with missing data in more than 30% of the loci were excluded from the statistical analyses (Table 1).
All assemblies produced contigs with skewed read coverage distributions, that is, those contigs that represented the "true alleles" had much higher average coverage than the remaining contigs ("noise"; see Figure 3). Noise was caused by PCR recombinants (Meyerhans, Vartanian, & Wain-Hobson, 1990; recognizable as chimera of the most common alleles) and sequencing errors (SNPs with <1% occurrence among all reads of the specific allele and individual).
Several individuals per species were sequenced two or more times (during variation assessment, multiplex testing, and the final sequencing run) at all loci. This allowed for the estimation of the error rate. The same number of alleles was retrieved for each locus and individual, though sequence variation occasionally occurred.

| Size homoplasy
The sequence data revealed that size homoplasy is very common (Table 2). It differed between species (mean over all loci 44.7%-63.5%) and-to a very high degree-between loci within one species, ranging from 20% to 100% with regard to the ratio of the number of fragment size classes with sequence variability/total number of size classes. Regarding the flanking regions, SNP variation was much more common than indel variation: SNP/indel ratios were 90/11, 97/7, and 71/7 for D. fascicularis, M. tridens, and O. obtusangulus, respectively. Many SSR loci also contained mutated SSR motifs (Table 3).
Size homoplasy was detected at all levels-among geographic populations, between different individuals of the same population, and even between alleles of the same individual (data not shown).
The degree of size homoplasy was not correlated with mean fragment length, number of repetitive units, number of variable sites, or number of SNPs in the flanking regions (Pearson's correlation coefficient <0.05, data not shown). To check for erroneous SNP calls, the number of rare alleles (occurring just once in the respective dataset) was recorded for each species and locus (Table 4). Based on the sequence-identity dataset, the percentage of rare alleles in relation to the total number of alleles differed between species. The percentage of rare alleles was highest in tetraploid M. tridens (38.4%), followed by D. fascicularis (21%), and lowest in O. obtusangulus (3.5%).   (Table 4). In case of M. tridens, two loci (mt28267 and mt34724) had no variation in length of neither the SSR nor the whole fragment, but had five and thirteen alleles, respectively, based on sequence identity. There was also one locus in D. fascicularis (df142807) that was monomorphic in the SSR-length dataset, but in case of fragment-length and sequence-identity datasets, allele numbers were two and eight, respectively.

| Output statistics and estimation of ploidy levels
Demultiplexing successfully recovered all loci of the three study species, although 12 out of 58 loci had low coverage (<10 reads per allele) and those were excluded from the analyses. The adjustment of relative primer concentrations within a given PCR multiplex group is rather approximate and cannot guarantee equal yield for each locus. The ploidy level of M. tridens has not been reported yet; however, the basic chromosome number of the genus Mulguraea is x = 10 and related species were reported to be di-, tetra-, and hexaploid (Botta & Brandham, 1993). Based on the number of retrieved alleles and graphs of the standardized number of reads per each allele of the respective locus and individual ( Figure 3), we inferred tetraploidy for M. tridens. Thus, the method can be successfully applied also for tri-or tetraploid species. The applicability in higher polyploids has yet to be tested, but based on our data of two other species included in the original study (Berberis microphylla, Chuquiraga aurea; data not presented here) suggest that allele detection in octo-and higher ploids might be complicated due to the presence of PCR recombinants (Brassac & Blattner, 2015), PCR duplicates, sequencing errors, and the problem of missing single-copy alleles.

| Size homoplasy
All SSR loci used in our study contained SNP variation in the FR and sometimes also in the repetitive region (Table 3), that is, the true allele number per population was always higher than when only length information would have been recorded (  Note. SSR motif: sequence of the SSR motif, number indicates how many times the motif is present. Variable sites include sequence variation in the SSR region itself and in the flanking region plus the variation in number of repetitive motifs (if the variation was present). N indicates numbers of variable sites, SNPs and indels, respectively. Locus names carrying the prefix df correspond to Donatia fascicularis, whereas mt and oo stand for Mulguraea tridens and Oreobolus obtusangulus.
a Includes two different SSR regions.

TA B L E 3 (Continued)
TA B L E 4 Total number of alleles and number of rare alleles (present only once in the respective dataset) for the three different datasets based on SSR-length, fragment-length (Fr length), and sequence-identity (Seq identity)  Unfortunately, their way of calculating homoplasy is not fully clear, so that a direct comparison with our data is not possible. Nonetheless, we expect that the amount of homoplasy will increase with an increasing number of genotyped individuals, especially if these are geographically and evolutionarily more distant to each other.
In the FR, SNPs were around 10 times more frequent than indels (Table 3). This ratio is much higher than the one reported by Mogg et al. (2002) for Zea mays, in which the mean ratio over all loci between SNPs and indels was 2:1. High number of SNPs could be caused by PCR/sequencing errors, although it is not very likely for several rea-  (Table 4). Nonetheless, negligible effects of sequencing errors cannot be ruled out.
In the case of D. fascicularis, the lengths of the indels of the FR were not congruent with the repeat motif length of the SSR, which was opposite to M. tridens and O. obtusangulus, where six and five out of six indels, respectively, could be confounded with tri-, tetra-, or hexanucleotide repeats if only fragment lengths were taken into account (Table 3).
The presented method allows using longer fragments (mean locus length was 329 bp over all loci and species with a maximum length of 418 bp), as compared to the traditional way of SSR scoring, increasing the likelihood that the FR contain genetic variation. This, in fact, does not prevent short fragments of already available primers and SSR loci to be successfully genotyped applying our method (Sochor, Šarhanová, Pfanzelt, & Trávníček, 2017). Nevertheless, the degree of size homoplasy did not correlate with mean fragment length, number of variable sites or number of SNPs or SSR units. Interestingly, the shortest locus of D. fascicularis (df91667) had the highest number of variable sites (Table 3). In the case of eventual correlations of size homoplasy and the number of variable sites, it should be considered that the way of calculating size homoplasy in the present study does not take into account how many alleles of the same fragment size class (differing in sequence, but not in length) are present. Therefore, the detection of a further SNP variant within a given fragment size class that already contains different sequences would not lead to a higher degree of size homoplasy. Although our SSR loci originated from RNA sequencing and we obtained blast hits for some of the loci, we could not test for a possible correlation between size homoplasy and the origin of sequences, that is, whether they stem from functional genes or the noncoding portion of the genome.

| Estimation of genetic diversity
The statistical analyses confirmed that in all three study species, the sequence-identity dataset conveyed more information than the SSRlength and fragment-length datasets (Tables 4 and 5  at all, the high resolution of the sequence-identity dataset is not informative either. Nonetheless, if additional information can be collected by sequencing of the studied loci, it is highly recommended to do so to ensure correct evolutionary interpretations. This mirrors Peakall et al.'s (1998: 1,283) earlier, but apparently often neglected, statement: "Consequently, DNA sequencing of SSR alleles will be essential to minimize the risk of misinterpretation and to maximize the genetic information that can be obtained." Now widely available NGS technologies make it possible to routinely score SSR alleles through sequencing.
On the other hand, genetic diversity parameters (H e and H o ) were rather similar for the SSR-length and fragment-length datasets, but were generally higher for the sequence-identity dataset in all three species (Table 5). This was especially remarkable in one locus of D. fascicularis (df142807) and two loci of M. tridens (mt28267 and mt34724), which appear to be monomorphic if only fragment or SSR lengths are considered. However, these loci were highly variable based on sequence identity (8, 13, and 5 alleles, respectively;

| Microsatellite mutation models
Permuted R ST values suggested for all but two studied loci (df91667, oo40886; Table 5) that stepwise mutations do not significantly contribute to genetic differentiation. Interestingly, the observed R ST value of locus df91667 was higher than the permuted R ST , indicating the fit to the SMM, but only for the SSR-length dataset. In fact, the four indels in the FR of that locus would mask this output in case of the traditional fragment length assessment. The second locus (oo40886) shows a bimodal distribution of the number of repeats (4-6 and 9-12) and thus fits rather to the two-phase model of microsatellite evolution (Di Rienzo et al., 1994), in which frequent singlestep, but also rare large changes in repeat number occur. The IAM does not fit to the evolution of most of the studied loci, as it does not allow for the existence of homoplasy (Estoup et al., 2002). Other models like proportional slippage/point mutation (Kruglyak, Durrett, Schug, & Aquadro, 1998), the K-allele model (Crow & Kimura, 1970), or more complex stepwise models can better reflect the evolution of microsatellites and should be considered in future research.

| SUMMARY
Our multiplex SSR sequencing strategy produced useful information about the actual nucleotide sequences of SSR amplicons and allowed for the detection and quantification of hidden variation in a large dataset of non-model plant species. It was shown that size homoplasy is a very common phenomenon and that indel polymorphism in the FR can be erroneously confounded with length variation within the SSR region. The additional information allows for a better understanding of microsatellite mutation processes. Sequencing of SSR loci is a prospective method with the ability to detect variability on both the intra-and inter-species level and thus can be suitable for both wide-and fine-scale phylogeographic studies based on single marker types.

ACK N OWLED G EM ENTS
The authors would like to thank the anonymous reviewers for their valuable comments to improve the quality of the manuscript. Funding from DFG (AL632/7-1, BL462/11), DAAD (D/10/49464) and Dr. Karl Wamsler Stiftung is acknowledged. CONAF (Chile), APN (Argentina), and the Falkland Islands Government issued collection permits and provided logistic help, as well as Centro EULA (Universidad de Concepción) and CEQUA (Universidad de Magallanes, Punta Arenas).

CO N FLI C T O F I NTE R E S T
The authors have no conflict of interests to declare.

AUTH O R CO NTR I B UTI O N S
The first two authors contributed equally to this work. PS, SP, and FRB carried out the fieldwork. PS and SP performed laboratory work and data analyses. PS, AH, RB, and FRB developed the barcode approach and devised the sequencing strategy. PS, SP, and FRB wrote the manuscript. All authors contributed to and approved the final manuscript version.

DATA ACCE SS I B I LIT Y
Allele sequences (accession numbers MG322761-MG323307) and RNA-Seq raw read data (SRA accession numbers SRX4496448-SRX4496451) were submitted to NCBI GenBank.