A pipeline for effectively developing highly polymorphic simple sequence repeats markers based on multi‐sample genomic data

Abstract Simple sequence repeats (SSRs) are widely used genetic markers in ecology, evolution, and conservation even in the genomics era, while a general limitation to their application is the difficulty of developing polymorphic SSR markers. Next‐generation sequencing (NGS) offers the opportunity for the rapid development of SSRs; however, previous studies developing SSRs using genomic data from only one individual need redundant experiments to test the polymorphisms of SSRs. In this study, we designed a pipeline for the rapid development of polymorphic SSR markers from multi‐sample genomic data. We used bioinformatic software to genotype multiple individuals using resequencing data, detected highly polymorphic SSRs prior to experimental validation, significantly improved the efficiency and reduced the experimental effort. The pipeline was successfully applied to a globally threatened species, the brown eared‐pheasant (Crossoptilon mantchuricum), which showed very low genomic diversity. The 20 newly developed SSR markers were highly polymorphic, the average number of alleles was much higher than the genomic average. We also evaluated the effect of the number of individuals and sequencing depth on the SSR mining results, and we found that 10 individuals and ~10X sequencing data were enough to obtain a sufficient number of polymorphic SSRs, even for species with low genetic diversity. Furthermore, the genome assembly of NGS data from the optimal number of individuals and sequencing depth can be used as an alternative reference genome if a high‐quality genome is not available. Our pipeline provided a paradigm for the application of NGS technology to mining and developing molecular markers for ecological and evolutionary studies.

genomic data. We used bioinformatic software to genotype multiple individuals using resequencing data, detected highly polymorphic SSRs prior to experimental validation, significantly improved the efficiency and reduced the experimental effort. The pipeline was successfully applied to a globally threatened species, the brown earedpheasant (Crossoptilon mantchuricum), which showed very low genomic diversity. The 20 newly developed SSR markers were highly polymorphic, the average number of alleles was much higher than the genomic average. We also evaluated the effect of the number of individuals and sequencing depth on the SSR mining results, and we found that 10 individuals and ~10X sequencing data were enough to obtain a sufficient number of polymorphic SSRs, even for species with low genetic diversity. Furthermore, the genome assembly of NGS data from the optimal number of individuals and sequencing depth can be used as an alternative reference genome if a high-quality genome is not available. Our pipeline provided a paradigm for the application of NGS technology to mining and developing molecular markers for ecological and evolutionary studies.

K E Y W O R D S
microsatellite, molecular marker, next-generation sequencing, resequencing, short tandem repeats, threatened species

| INTRODUC TI ON
Simple sequence repeats (SSRs), or microsatellites, are highly variable genetic markers useful for a wide variety of applications in genetic analysis, including genetic mapping, population structure and gene flow analysis, identification of conservation units, and kinship analysis (Gerber et al., 2000;Vashistha et al., 2020;Zamudio & Wieczorek, 2000). Since the first application of SSRs in the 1990s, they have been extensively and continuously used in evolutionary, ecological, and conservation research even in the genomics era (Ali et al., 2019;Allendorf, 2017;Shahabzadeh et al., 2020).
Despite the many advantages of SSR markers, a general limitation to their application is the difficulty of developing polymorphic SSR markers (Squirrell et al., 2003). The development of new SSR markers can basically be divided into the following stages: (1) identification of the sequences containing SSRs; (2) design of PCR primers from flanking regions; and (3) detection of polymorphisms among individuals (Andrés & Bogdanowicz, 2011;Vieira et al., 2016). Traditional approaches, consisting of cloning, cDNA library construction, and Sanger sequencing, are time-consuming, labor-intensive, and inefficient in the SSR identification and primer design stages (Zane et al., 2002). Next-generation sequencing (NGS) can largely overcome these shortcomings, providing effective ways to mine tens of thousands of SSR sequences with sufficient flanking regions. Therefore, NGS technologies are increasingly been applied to obtain novel SSR markers in non-model organisms (Abdelkrim et al., 2009;Gardner et al., 2011;Wang et al., 2017).
However, previous studies usually used genomic data from only one individual to mine SSR sequences and then randomly selected a few hundred SSRs for polymorphism detection in several individuals through PCR experiments (McCulloch & Stevens, 2011;Zhou et al., 2016). As you can see, it needs to design a lot of primers to manually test their polymorphisms, which is still time consuming and inefficient. The rate of obtaining polymorphic SSR markers is still not high (Taheri et al., 2018), especially for threatened species in which among-individual genetic differences are subtle. Only a small percentage of randomly selected loci were highly polymorphic and easy to amplify (e.g., Hou et al., 2018;Yang et al., 2017). Thus, the limiting step for SSR development through NGS technologies is no longer SSR identification or primer design, but instead, detection and screening of polymorphic loci. One strategy to break this limitation is to track SSR polymorphisms before PCR experiments, which can be done by sequencing multiple individuals. One idea is to develop primer sequences for every SSR loci from each individual, the primer sequences were then used to identify intersectional SSR loci, these SSR loci were extracted to evaluate their polymorphism (e.g., Cui et al., 2018;Fox et al., 2019). With the development of NGS software, a more straightforward way is to align sequence reads from all individuals to a reference genome to identify polymorphic SSR loci before developing primers (e.g., Guo et al., 2020). This is still a rather new point of view, the effects of using different number of individuals, different sequence depth, and the quality of the reference genome on the yield of polymorphic SSR markers has rarely been explored, especially for threatened species.
The aim of this study is to develop an effective pipeline for the rapid development of highly polymorphic SSR markers from multisample genomic data. We used a subset of the population genomic data from a global threatened galliform bird, the brown earedpheasant (Crossoptilon mantchuricum), which has very low genomic diversity (Wang et al., 2020). We showed that our pipeline can effectively discover polymorphic SSR markers from such a species and successfully estimate its population structure with the developed SSRs, which showed the same result from genomic data (Wang et al., 2020). Furthermore, we evaluated the effect of different numbers of individuals and sequencing depth on the SSR mining results and assembled a reference genome using multi-sample low-depth data instead of single-sample high-depth data, which could be a reasonable strategy balancing the sampling and sequencing costs, especially for species without a reference genome.

| Data and sample collection
For the in silico mining of polymorphic SSRs, we used the assembled genome of one individual and resequencing data (~20 X) of 20 individuals of C. mantchuricum downloaded from the National Genomics Data Center (BioProject number: PRJCA003284, https://bigd.big. ac.cn/?lang=en), the sample information and accession number of each sample can be seen in Table S1. All the 20 individuals used in this study are not closely related (Wang et al., 2020).
For the experimental validation of the SSR markers, a total of 30 wild individuals of C. mantchuricum were sampled from Hebei (n = 6), Beijing (n = 2), Shanxi (n = 15), and Shaanxi (n = 7) in China. The tissue or blood samples were preserved at −80°C for long-term storage.
Genomic DNA was extracted using a DNA extraction kit (TianGen Biotech, Beijing, China) following the manufacturer's instructions.
After tandem repeats identification, we obtained a BED file with our custom set of tandem repeats. Then, we used lobstr_index.py in lob-STR v 4.0.6 (Gymrek et al., 2012) and the BED file to build a custom lobSTR reference for C. mantchuricum (http://lobstr.teame rlich.org/ best-pract ices-custo m-refer ence.html). Meanwhile, the raw reads of the 20 individuals of C. mantchuricum were filtered with Trim Galore v 0.5.0 (Krueger, 2012) with default parameters, and clean reads were mapped to the C. mantchuricum reference genome with BWA-MEM v 0.7.17-r1188 ) to generate BAM files. Then, we used these BAM files and the custom reference as input for lobSTR (Gymrek et al., 2012) to run allelotypes. After allelotyping, we used a custom Bash script (Appendix S1) to select polymorphic SSR loci from the VCF file generated by lobSTR.
We used a very strict criterion to select SSRs for subsequent experimental validation. First, we focused only on the "perfect" SSRs (uninterrupted run of repeats) (Sharma et al., 2007) that can be successfully genotyped across all individuals (NS = 20) to avoid PCR failure and null alleles. Second, we restricted the motif length to 3-5 bp to avoid genotyping error. Third, we selected SSRs with high polymorphism, that is, the number of alleles for each locus ≥5 (see results), among which 34 potential polymorphic loci comprising different motif lengths were selected for downstream analyses.

| SSR primer design and experimental validation
First, we used BEDTools v 2.26.0 (Quinlan & Hall, 2010) (Table S2) for 30 s, 72°C for 30 s; and a final step at 72°C for 5 min. The PCR products were detected by 2% agarose gel electrophoresis. As a result, we obtained 30 loci that were reliably amplified.
We randomly selected 20 of 30 SSR candidates to synthesize fluorescently labeled forward primers (5′-FAM, HEX, ROX·; Beijing Genomics Institute, Beijing, China) and performed PCR amplification of all 30 individuals of C. mantchuricum as described above. The PCR products were sent to Qingke Biotech (Beijing, China) for SSR genotyping detection. Allele scoring for each marker was performed with Genemarker v 2.2.0 (Holland & Parson, 2011).

| Statistical and population structure analyses
Genetic parameters such as the number of alleles (N a ), polymorphism information content (PIC), observed heterozygosity (H o ), and expected heterozygosity (H e ) were calculated by Cervus v3.0 (Marshall et al., 1998). The frequency of null alleles was estimated using FreeNA (Chapuis & Estoup, 2007). Linkage equilibrium (LD) were tested using Genepop (Rousset, 2008) with the following parameters: dememorization = 10,000, batches = 20, iterations per batch = 5000. The Bonferroni correction for p value was done by Myriads v1.2 (Carvajal-Rodríguez, 2017). F I G U R E 1 Workflow for in silico microsatellite mining, polymorphism discovery, and primer design using a series of commonly used software programs (shown in italics). The pipeline takes multi-sample resequencing data in FASTQ format and reference genome in FASTA format as input data (the reference genome can be generated with assembly software such as MaSuRCA from multi-sample resequencing data for species whose reference genomes were unavailable) Population structure analyses were performed using principal coordinate analysis (PCoA) in GenAlEx v6.5 (Peakall & Smouse, 2006) and the model-based software program STRUCTURE v2.3.4 (Pritchard et al., 2000). The number of subpopulations (K) was set to range from 1 to 10, and for each K, 10 replications were tested. For each run, a burn-in period was set to 100,000 with 100,000 MCMC iterations. The log probability of the data (LnP(D)) was calculated to confirm the convergence. To determine the most likely value of K, the Evanno method (Evanno et al., 2005) was used via the online program STRUCTURE HARVESTER (http:// taylo r0.biolo gy.ucla.edu/struc tureH arves ter/) (Earl & Vonholdt, 2012). Genetic differentiation among the populations was calculated with the Weir and Cockerham (1984) estimator of the fixation index (F st ) using FSTAT v2.9.4 and 1,000 permutations were used to test for significant differences (Goudet, 1995;Weir & Cockerham, 1984).

| Effects of the number of individuals and sequencing depth on SSR mining
To test the effect of the number of individuals on SSR mining, we randomly selected 2, 4, 6, 8, 10, 12, 14, 16, and 18 individuals from the 20 individuals (sequencing depth ~20X) to perform the same analyses as described above (Figure 1). Second, we fixed the number of individuals as 10 (the optimal number of individuals based on our results) to explore the effect of sequencing depth. The average sequencing depth of each sample was calculated by the tool "bamdst" (https://github.com/shiqu an/bamdst). Then, we used SAMtools v1.9  to randomly generate 2.5X, 5X, 7.5X, 10X, 12.5X, 15X, 17.5X, and 20X resequencing data for each of the 10 individuals and performed the same analyses ( Figure 1). For each analysis, we focused only on the SSR loci that existed in all selected individuals and exhibited at least two alleles. To evaluate the SSR mining results, we calculated two parameters: the number of polymorphic SSRs and the Na for each SSR locus. Then, we used R v4.0.2 (R Core Team, 2020) to draw line charts and violin plots to visualize the increasing trend of these two parameters to estimate the optimal values for the number of individuals and sequencing depth. The magnitude of change of the average Na between different individuals and sequencing depth was assessed using Cohen's d effect size analysis.
A value of 0.20 is considered a small effect, 0.50 is considered a medium effect (Cohen, 1992).

| SSR mining using multi-sample lowdepth resequencing data without a prior/known reference genome
Generally, a high-quality reference genome is necessary to map resequencing data and to develop SSR markers (Hou et al., 2018).
However, the assembly of an eligible reference genome usually requires deep sequencing >100X from the same individual, which results in considerable additional cost (Desai et al., 2013). To fully utilize the multi-sample resequencing data and reduce the sequencing cost, we derived the idea used in pan-genome studies and tried to use multi-sample low-depth data to assemble a "consensus" reference genome of C. mantchuricum. 10X resequencing data of each ten individual (100 X data in total, the optimal number of individuals and sequencing depth based on our results) were used for de novo assembly of the C. mantchuricum genome with MaSuRCA assembler v3.4.2 (Zimin et al., 2013). MaSuRCA is an overlap-layout-consensus (OLC) algorithm-based assembler that tolerates differences such as SNPs, heterozygotes, and sequencing errors to generate consensus.
This feature enables it to generate a consensus assembly by integrating multi-sample low-depth data as conventional deep sequencing data. We used the assembled genome as a reference and carried out SSR mining using resequencing data following the designed pipeline ( Figure 1). To test the validity of the assembled "consensus" genome on SSR mining results, we calculated the number of polymorphic SSRs and the Na for each SSR locus and compared these SSRs developed with the "consensus" genome to those SSRs developed with the high-quality genome. In addition, we also compared the ±350 bp flanking sequences on both ends of all the polymorphic SSRs with BLASTN v2.5.0+ (Altschul et al., 1990) to identify the intersection of polymorphic SSRs extracted from the two reference genomes.
The SSR loci with >95% identity and >500 bp alignment length were considered the same loci.
The average number of alleles (N a ) for the 12,549 SSRs was only 1.36, and 81.01% of the loci were monomorphic (Figure 2). The proportion of polymorphic loci for dinucleotide repeats was higher than that for other types (Figure 2). However, dinucleotide microsatellites are easily subjected to mistyping due to polymerase slippage during PCR (Schlötterer & Tautz, 1992). To develop SSR markers with strong stability and a low genotyping error rate, we only focused on SSRs with motif lengths ranging from 3 to 5 bp in this exploratory study.
We obtained 952, 228, 83, 34, and 10 loci when we restricted the minimum Na to 2, 3, 4, 5, and 6, respectively (Table S3). Based on the above results, we selected the 34 SSRs with an N a ≥ 5 to perform downstream analyses.

| Descriptive statistical and population structure
The PIC values of the 20 SSR loci ranged from 0.315 to 0.707 with an average of 0.532. The N a ranged from 2 to 6 with an average of 4.2, the H o ranged from 0 to 0.633 with an average of 0.345, and the He ranged from 0.398 to 0.760 with an average of 0.597 (Table 1).
PCoA divided the 30 brown eared-pheasants into three genetic clusters (Figure 3). The first two principal coordinates (PCos) explained 60.23% of the total variance (45.22% and 15.01%, respectively). In our 10 independent structure analyses to estimate K, the values of LnP(D) increased sharply from K = 1 to K = 3, and delta K analysis revealed a peak at K = 2 ( Figure S1), suggesting at least two clusters. When K = 3, the three populations of brown earedpheasant can also be separated very well (Figure 3), which is in accordance with the PCoA results. We identified the three brown  (Table S4).
The population structure analyses showed high genetic differentiation among the three populations of the brown eared-pheasant, so we estimated null allele frequency of the 20 loci in three population separately. The prevalence of null alleles for most loci is low (<0.05), except for loci CM27 and CM12 in the central and eastern populations, respectively (>0.2, Table S5). The average null allele frequency of the 20 loci is low (<0.05) among three populations. Such a low frequency of null alleles only has slight impact on population genetic analyses (Carlsson, 2008;Chapuis & Estoup, 2007;Dakin & Avise, 2004). Only 2 out of 570 tests for LD were significant after Bonferroni correction (CM8-CM14 and CM11-CM25 in the central population). These four SSR loci located on different scaffolds, the observed LD might be caused by the low genomic diversity (Wang et al., 2020) and/or small sample size rather than true linkage.

| Effect of the number of individuals and sequencing depth on mining SSRs
The number of individuals had a great influence on the calculated N a , while the sequencing depth had a great influence on the obtained number of polymorphic SSRs (Figure 4). When the number of individuals reached 10, the increasing trend of Na slows down (Cohen's d: Na 2 vs. Na 10 : −0.61 (medium), Na 10 vs. Na 20 : −0.12 (small)), and the number of polymorphic SSRs reached approximately 2,037 (85.48% compared to using 20 individuals; Figure 4a). Thus, we fixed the number of individuals to ten to explore the effect of sequencing depth.
Our results showed that Na remained nearly stable with increasing depth (2.460-2.695; Cohen's d: Na 2.5x vs. Na 20x : −0.23 (small)), while the number of polymorphic SSR loci increased rapidly (from 50 to 2022, Figure 4b, Table S6). The increase in the number of SSR loci slowed when the sequencing depth reached 10-12.5 X (Figure 4b), when sufficient (1539-1825) polymorphic SSR loci were identified.

| SSR mining results using multi-sample lowdepth resequencing data without a prior/known reference genome
The draft "consensus" genome assembly based on resequencing data and 100.00 kb, respectively. Using this assembled genome as the reference genome, we found a total of 9306 "perfect" SSR loci, of which 1590 (17.09%) were polymorphic. In comparison, the number of polymorphic SSR loci extracted from the canonical reference genome was 1,539, which was similar to the assembled genome (1590 vs. 1539, Figure S2). The BLASTN analysis showed that nearly 80% (1216/1539) of these loci overlapped. The average Na of all the SSR loci from the assembled genome was 1.275, which showed no significant difference compared to that of the SSR loci obtained using the canonical reference genome (average N a = 1.271; t-test: T = 0.37, df = 18460, p = .71).

| DISCUSS ION
Although genome-wide SNPs have become more and more popular for studies of population genetics, SSRs are still valuable genetic markers due to their high polymorphism, low DNA template demands, relatively easy application, along with well-developed and simple statistical analyses (Hodel et al., 2016;Zane et al., 2002). There are several scenarios where SSRs are comparable with genome-wide SNPs. For example, studies require parentage and kinship determination in behavioral ecology and genetic management do not require high marker density, but benefit more from large number of samples (de Deus et al., 2021). It is impractical and expensive to genotype thousands of individuals using genome-wide SNPs and it is hard to update the dataset if small numbers of new individuals are added. Conversely, once the SSR markers has been developed, it would be much easier and more economical to genotype additional individuals (Puckett, 2017). In addition, SSR is still the most widely used genetic marker in forensic identifications and noninvasive genetic studies of endangered species from degraded samples owing to its low quantity/quality DNA template demand and high reproducibility results (Lampa et al., 2013;Willows-Munro & Kleinhans, 2020). We can acquire sufficient DNA for SSR genotyping even in degraded samples such as eggshells, feathers, and feces (Baus et al., 2019). Furthermore, a strong background in computing skills and bioinformatics is needed to deal with the large quantity of SNPs, whereas researchers can complete SSR analyses with limited computing skills on a laptop computer (Hodel et al., 2016). For all these reasons, microsatellites remain a good choice for many systems and questions and they will continue to be used extensively in ecology, evolution, and conservation in the future.
In this study, we developed a pipeline to mine polymorphic SSR markers based on NGS data from multiple individuals of the target species. The pipeline was successfully applied to a globally threatened species with very low genomic diversity (Wang et al., 2020).
We further evaluated the effect of different numbers of individuals and sequence depths on the SSR mining results to suggest a reasonable strategy balancing data generation and cost. Additionally, we showed that the pipeline worked well even without a high-quality reference genome, which further extended its application range and decreased the cost of developing applicable polymorphic SSR markers.
We found that the average Na was only 1.36 for the brown eared-pheasant at the genome scale, and less than 10% of SSRs had more than two alleles among 20 individuals (Figure 2, Table   S3). Therefore, it will be rather inefficient to filter polymorphic SSR markers through experimental validation from randomly chosen SSR loci, which is the commonly used SSR marker development method using NGS data (Table 2) (Hou et al., 2018;Huang et al., 2015;Taheri et al., 2018). For example, Zhu (2014)  randomly selected SSR markers are polymorphic, the Na ranged from 2 to 6, with an average of 4.2 among 30 individuals, which was significantly higher than the average Na on the genome scale.
Our results showed that the increase trend of Na slows down after subsampling more than 10 individuals (Figure 4a).  (Koshiishi et al., 2021;Yang et al., 2017). As for the multisample strategy, we used multi-sample low-depth data to generate a draft reference genome inspired by pan-genome strategies, The scaffold and contig N50 of the assembled genome were approximately 134 kb and 100 kb, respectively, which are lower than those of the canonical high-quality genome (scaffold/contig N50: 3,632.75 kb/112.76 kb;Wang et al., 2020). Although the quality of the assembled genome was lower than that of the canonical highquality genome, the numbers of polymorphic SSR loci mined with our pipeline were very similar, and approximately 80% of SSR loci overlapped, which might be higher if we lower the length standard of the flanking sequence. Furthermore, the average Na between SSR markers from the assembled genome and SSR markers from the canonical high-quality reference genome showed no significant difference, which means that the distribution of SSRs in the "consensus" genome was highly consistent with the high-quality reference genome. Overall, the use of a reference genome by using a "consensus" genome strategy from multi-sample low-depth data can yield approximately the same number of polymorphic SSR loci, which can further reduce the cost of developing SSR markers. As the sequencing cost of NGS has dramatically declined since its invention (https://www.genome.gov/about -genom ics/fact-sheet s/ DNA-Seque ncing -Costs -Data), the cost for resequencing 10X data from 10 individual of birds is about $1100. If we randomly design 100 primers to detect polymorphism on 4 individuals (many studies used this strategy), the cost is about $550 for primer synthesis, and $1000 for sanger sequencing, which are similar or even slightly higher than the resequencing cost.
The brown eared-pheasant is a globally threatened species distributed in China (Zheng, 2015), and polymorphic SSR markers for this species are still unavailable. In this study, we developed 20 new SSR markers. The PIC indicated that 12 markers were highly informative (PIC > 0.5), and the other eight were reasonably informative (0.5 < PIC < 0.25) (Botstein et al., 1980). These SSR loci were successfully applied to the population structure analysis for the brown eared-pheasant. The PCoA and structure analysis revealed three populations across the range of the brown eared-pheasant ( Figure 3), in accordance with the results from genomic SNP data (Wang et al., 2020). However, the structure analysis revealed a peak of delta K at K = 2, while it separated the three populations very well when K = 3 ( Figure 3). Previous studies found that there was a strong bias toward selecting K = 2 using the delta K method (Cullingham et al., 2020). In addition, uneven sample sizes between subpopulations may lead to the underestimation of delta K (Puechmaille, 2016 Higher polymorphic dinucleotide SSRs can be easily obtained from our pipeline for further research.

| CON CLUS ION
In this study, we developed a pipeline for the rapid development of polymorphic SSR markers using multi-sample genomic data. Our pipeline can be easily applied in non-model species in which genomic information is unknown and in threatened species in which genetic diversity is extremely low. Our pipeline provided a paradigm for the application of NGS technology in mining molecular markers for ecological and evolutionary studies.

ACK N OWLED G M ENTS
This work was supported by the National Natural Science editors and two anonymous reviewers whose feedback helped to improve the manuscript substantially.

CO N FLI C T O F I NTE R E S T
None declared.