Phylogenetic significance of the characteristics of simple sequence repeats at the genus level based on the complete chloroplast genome sequences of Cyatheaceae

Abstract The simple sequence repeats (SSRs) of plant chloroplasts show considerable genetic variation and have been widely used in species identification and phylogenetic relationship determination. Whether chloroplast genome SSRs can be used to classify Cyatheaceae species has not yet been studied. Therefore, the chloroplast genomes of eight Cyatheaceae species were sequenced, and their SSR characteristics were compared and statistically analyzed. The results showed that the chloroplast genome structure was highly conserved (genome size: 154,046–166,151 bp), and the gene content (117 genes) and gene order were highly consistent. The distribution characteristics of SSRs (number, relative abundance, relative density, GC content) showed taxon specificity. The primary results were the total numbers of SSRs and mononucleotides: Gymnosphaera (61–67 and 40–47, respectively), Alsophila (121–122 and 95–96), and Sphaeropteris (102–103 and 77–80). Statistical and clustering analyses of SSR characteristics showed that their distribution was consistent with the recent classification of Cyatheaceae, which divided the eight Cyatheaceae species into three genera. This study indicates that the distribution characteristics of Cyatheaceae chloroplast SSRs can provide useful phylogenic information at the genus level.


| INTRODUC TI ON
Simple sequence repeats (SSRs), also known as microsatellites, are short tandem repeat sequences with a motif length of 1-6 bp characterized by high variability and codominant inheritance and have been widely used in species identification, genetic diversity studies, and phylogenetic relationship determination (Chmielewski et al., 2015;Dashnow et al., 2015). SSRs are caused by slipped strand mispairing and subsequent errors during DNA replication, repair, and recombination (Levinson & Gutman, 1987). SSRs are mainly found in intergenic and noncoding regions, with a few present in introns (Li et al., 2004;Liu et al., 2021;Su et al., 2018). Previous studies have shown that the characteristics of genomic SSRs in different taxa (such as their distribution patterns) reflect their phylogenetic relationships (Manee et al., 2020;Srivastava et al., 2019).
The distribution of SSRs in some chloroplast (cp) genomes is nonrandom and dominated by mononucleotides, where A/T bases account for the majority (Ellegren, 2004;George et al., 2015;Ren et al., 2021). Some Polypodiaceae species show similar SSR distribution patterns in their cp genomes . Recently, Ping et al. (2021) analyzed the distribution pattern of Cupressus SSRs and found that the distribution patterns in Cupressus and Hesperocyparis are highly consistent. In addition, according to the proportions of A/T bases and mononucleotides in Callitropsis funebris, this species is closer to Cupressus. Studies have shown that the number and types of SSRs in cp genomes are conserved within genera, and the types of SSRs differ extensively among genera in Dryopteridaceae (Fan et al., 2021). Furthermore, cpSSRs continue to provide important new clues to explore the phylogeny among lineages.
The Cyatheaceae are an impressive group of ferns because of their arborescent features and a rich number of species, which account for the vast majority of known tree ferns and are mainly distributed in warm and humid tropical and subtropical regions Kramer, 1990;PPG I, 2016;Smith et al., 2006).
In this study, the cp genomes of eight Cyatheaceae species were sequenced, and the distribution and characteristics of their SSRs were compared. The cp genome sequences of these eight species represent the existing cp genome data for Cyatheaceae, covering most Cyatheaceae genera. Our major objectives were to (a) report the complete cp genomes of A. denticulate and A. metteniana; (b) compare the distribution patterns of the cpSSRs of the eight Cyatheaceae species; and (c) reveal the phylogenetic significance of the SSRs characteristics. Our findings may serve as a foundation for studying the evolutionary cp genomics and phylogeny of Cyatheaceae.

| Sampling
The leaves of Gymnospaera denticulata Baker and Gymnospaera metteniana Hance were collected from Nankunshan in Huizhou and the botanical garden of South China Agricultural University in Guangzhou, respectively. The specimens of Gymnospaera denticulata Baker and Gymnospaera metteniana Hance are stored in the Herbarium of South China Agricultural University (SCAUB; voucher: M Zhu 201910 and M Zhu 201908). The leaves of Gymnospaera podophylla Hook and Gymnospaera gigantea Wall. ex Hook were collected from the South China Botanical Garden of the Chinese Academy of Sciences in Guangzhou (Liu et al., 2018;Wang et al., 2019b). The leaves of Alsophila costularis Baker, Sphaeropteris brunoniana (Hook.) R. M. Tryon, and Sphaeropteris lepifera (Hook.) R. M. Tryon were collected from the Fairy Lake Botanical Garden of the Chinese Academy of Sciences in Shenzhen (Liu et al., 2020;Wang et al., 2019a;Zhu et al., 2020). The leaves of Alsophila spinulosa (Wall. ex Hook.) R.
M. Tryon were collected from the Wuhan Botanical Garden of the Chinese Academy of Sciences in Wuhan (Gao et al., 2009). Fresh young leaves from well-grown plants were collected, wrapped in tin paper, flash-frozen in liquid nitrogen, and then stored at −80°C before use.

| DNA extraction and sequencing
A plant genomic DNA extraction kit (TIANGEN) was used to extract total DNA from the samples. After the quality of the total DNA samples was confirmed by Shanghai Hanyu Biotechnology Co., Ltd., the samples were subjected to bidirectional sequencing using an Illumina HiSeq 2500, and the raw data obtained were converted into raw reads by CASAVA base-calling analysis. The clean data obtained after removing the adaptor-containing, low-quality sequences were taken for subsequent analysis. Data processing was performed by Trimmomatic v0.32 (Bolger et al., 2014) with the following steps: (a) removal of sequences containing N bases; (b) removal of adaptor sequences in the reads; (c) removal of low-quality bases (Q value < 20) from the reads in the 3′ to 5′ direction; (d) removal of low-quality bases (Q value < 20) from the reads in the 5′ to 3′ direction; (e) removal of four bases with an average base quality <20; and (f) removal of the reads and their pairs with a length <50 nt. Velvet v1.2.03 (Zerbino & Birney, 2008) was used to assemble the clean data.

| Characterization of chloroplast genomes
The cp genome of Alsophila spinulosa was used as the reference genome, and Dual Organellar GenoMe Annotator (DOGMA) (Milne et al., 2010) was used to predict the protein-coding genes, rRNA genes, and tRNA genes in other genomes. Geneious Prime (Kearse et al., 2012) was used for manual correction according to the reference genome. The Shuffle-Lagan mode in the online software mVISTA (Frazer et al., 2004) was used for genome-wide comparison.
Organellar Genome DRAW (OGDRAW) (Lohse et al., 2007) was used to draw physical cp genome maps, and Sequin software was used for submission of the cp genome of G. denticulata. Microsatellite repeats were predicted using the software MISA (Beier et al., 2017).
The threshold repeat number of mononucleotide units was set to 10, the threshold repeat number of dinucleotide units was set to six, the threshold repeat number of trinucleotide units was set to five, and the threshold repeat number of tetra-, penta-, and hexanucleotide units was set to three. The minimum distance between two SSRs was set to 0 bp, that is, there was no statistical compound SSR. The distribution characteristics of SSRs of different species in the whole genome and its different regions were compared and analyzed. Among these characteristics, the relative abundance refers to the number of SSRs in the unit sequence length (kb), and the relative density refers to the length of the SSRs (bp) in the unit sequence length (kb).

| Phylogenetic analysis
The maximum likelihood (ML), Bayesian inference (BI), maximum parsimony (MP), and neighbor-joining (NJ) methods were used for phylogenetic analysis. MAFFT software (Katoh & Standley, 2013) was used to align the complete cp genome sequences of eight species of Cyatheaceae and one species of Cibotium, Cibotium barometz (Linn.) J. Sm. A phylogenetic tree was constructed using C. barometz (Linn.) J. Sm. as an outgroup. When the ML, MP, and BI trees had been constructed, the whole cp genome was screened in MrModeltest software to obtain the optimal nucleotide substitution model (GTR + I + G) selected based on the Akaike information criterion, and the relevant parameters were estimated. The ML tree was constructed by the software RAxML8.0.20 (Stamatakis, 2014), GTRGAMMAI was selected as the nucleotide substitution model, and the confidence of the branch was completed using the bootstrap analysis in autoMR. The BI tree was constructed by MrBayes v3.2.0 software (Ronquist et al., 2012) and was estimated by running 2,000,000 generations (Nst = 6, rates = invgamma). The MP tree was constructed in PAUP 4.0 software (Swofford, 2002) with the bootstrap value set to 1,000. The NJ tree was constructed in MEGA 7.0 software (Kumar et al., 2016), and the maximum composite likelihood algorithm was selected with the bootstrap value set to 1,000 times. The resulting phylogenetic tree was viewed and edited in Figtree v 1.4.3 software.

| Statistical analysis
When Gymnosphaera is considered an independent taxonomic unit at the genus level, the eight Cyatheaceae species are divided into three genera; that is, G. denticulata, G. podophylla, G. metteniana, and G. gigantea belong to the genus Gymnosphaera; A. spinulosa and A. costularis belong to the genus Alsophila; and S. brunoniana and S. lepifera belong to the genus Sphaeropteris. When Gymnosphaera is classified into the genus Alsophila, Cyatheaceae is divided into two genera. The Kruskal-Wallis H test and Mann-Whitney U test in IBM SPSS v22.0 software (Allen et al., 2014) were used to analyze the significance of differences between taxa when three genera and two genera were assumed, respectively. The statistical results covered the whole cp genome, the SSRs of different unit lengths in the cp genome, and the number, relative abundance, relative density, and GC content of SSRs and SSRs of different unit lengths in the intergenic spacer (IGS), large single-copy (LSC), intronic, and coding sequence (CDS) regions of the cp genomes of the eight Cyatheaceae species.
Photovoltaic (PV) cluster analysis using the ward linkage method in R v3.5.1 (R Core Team, 2013) was performed on the SSRs of each cp genome and its IGS and LSC regions and on the number, relative abundance, relative density, and GC content of mononucleotide SSRs of the cp genomes with the Euclidean distance as the measurement. The number of repetitions was 10,000.

| Genome structures and characteristics
The cp genomes of all eight Cyatheaceae species are doublestranded, closed, circular molecules with a typical tetrad structure (with G. denticulata as an example, as shown in Figure  with the total GC content ranging from 40.3% to 41.9% (Table 1).
Only LSC, SSC, and one IR were analyzed. The cp genome of each Cyatheaceae species contained 117 genes, which encoded 85 proteins, four rRNAs, and 28 tRNAs. Pseudogenes (ycf66, trnT-UGU) are also present in these genomes. Among these genes, 13 are located in the IR region. The ndhB gene spans the LSC and IRA regions, and there is a duplicated exon 2 sequence of the ndhB gene is present near the boundary of the IRB. Twelve genes have one intron, and three genes (ycf3, clpP, and rps12) have two introns.

| Analysis of the characteristics of SSRs
The number, relative abundance, relative density, and GC content of SSRs in the cp genomes of all eight Cyatheaceae species were systematically compared ( Table 2). The number (121-122), relative abundance (0.77-0.78/bp), relative density (9.81-9.82 bp/kb), and GC content (0.18-0.20) of SSRs in the cp genomes of A. spinulosa and A. costularis; the number (102), relative abundance (0.63-0.65/bp), relative density (6.70-8.18 bp/kb), and GC content (0.08-0.10) of SSRs in the cp genomes of S. brunoniana and S. lepifera; and the number (61-67), relative abundance (0.40/bp), relative density (4.11-5.06 bp/kb), and GC content (0.22-0.29) of SSRs of G. denticulata, G. podophylla, G. metteniana, and G. gigantea had similar values, which were not proportional to the sizes of the genomes. When Gymnosphaera was considered as an independent classification unit at the genus level, the eight species of Cyatheaceae were divided into three genera. That is, G. denticulata, G. podophylla, G. metteniana, and G. gigantea belonged to the genus Gymnosphaera, A. spinulosa and A. costularis belonged to the genus Alsophila, and S. brunoniana and S. lepifera belonged to the genus Sphaeropteris, which indicated that in the phylogenetic context of the three genera, the characteristics of SSRs are genus specific at the level of the genome. In Alsophila, the number, relative abundance, and relative density of SSRs were the highest among the eight species, and they were the smallest in Gymnosphaera.
The highest GC content was found in Gymnosphaera, and the low-  Tables S9 and S10). For the F I G U R E 1 Gene map of the cp genome of Gymnospaera denticulata. Genes located in the outside of the outer circle are transcribed in the counterclockwise direction, whereas those in the inside of the circle are transcribed in the clockwise direction. Color codes represent different functional gene groups. In the middle circle, the GC and AT content variations are indicated by darker and lighter gray, respectively distributions of SSRs on the genome, the SSRs located in LSC (58.46%-73.52%) were most enriched in each species, followed by SSC (12.65%-21.08%), and the least in IR (9.84%-16.53%). The number of SSRs was highest in LSC and IR in Alsophila (81)(82)20) and lowest in Gymnosphaera (42)(43)(44)(45)(8)(9)(10). Furthermore, SSRs accounted for 75.5%-86.2%, 13.7%-20.6%, 2.0%-3.9%, and 2.0% of the IGS, intron, CDS, and rRNA gene regions (pseudogenes were treated as IGS regions). Among them, SSRs were detected only in the CDS regions of the cp genomes of Sphaeropteris and Alsophila, and SSRs were detected in the rRNA genes of the cp genomes of Sphaeropteris. The SSRs at IGS regions were most enriched in Alsophila (100-101) and the least in Gymnosphaera (49-56). The number of SSRs located in the intron regions was 18 in Alsophila and 9-13 in Gymnosphaera. These results showed that in the phylogenetic context dividing the eight Cyatheaceae species into three genera, different taxa had different patterns of SSR characteristics in the cp genome and its different regions, namely, the SSR characteristics of the cp genomes of the eight Cyatheaceae species were consistent with their phylogenetic relationship.

| Analysis of the types and characteristics of SSRs of different nucleotide numbers
The proportions of mono-, di-, tri-, tetra-, and pentanucleotide SSRs in each species were 62.5%-78.0%, 10.6%-15.6%, 0%-3.3%, 9.0%-18.5%, and 0%-1.5%, respectively. In the distribution of species, the number of mononucleotide had obvious genus specificity: Gymnosphaera (40-47), Sphaeropteris (77-80), and Alsophila (95-96). No hexanucleotide SSRs were detected. Among mononucleotide repeats, more A/T motifs were observed, and the dinucleotide repeats were dominated by AT/TA motifs. There were more tetranucleotide SSRs than tri-, and pentanucleotide SSRs. Trinucleotide SSRs do not exist in Sphaeropteris, and pentanucleotide SSRs do not exist in Sphaeropteris, Alsophila, and G. gigantea. The mono-, di-, tri-, tetra-, and pentanucleotide SSRs of the cp genomes of the three genera were present in similar numbers, relative abundance, relative density, and GC content at the level of the genome and in the specific regions of the genome (LSC, SSC, and IRs; IGS, intron, CDS, and rRNA gene regions), which was especially true for mononucleotide and dinucleotide SSRs (Table 3; Appendix Tables   S9 and S10).
The number, relative abundance, relative density, and GC content of SSRs of different unit lengths in the cp genome and its different regions had genus specificity in the phylogenetic context of dividing the eight Cyatheaceae species into three genera. In addition, the number, relative abundance, and relative density of SSRs of different base types in the cp genomes of the three genera of plants also had genus specificity, which was especially true for mono-and dinucleotide SSRs (Figure 3; Appendix Table S9). Alsophila had the highest A/T and C/G motif content (77,(18)(19), Gymnosphaera had the least A/T motif content (26-32), and Sphaeropteris had the least C/G motif content (6)

| Phylogenetic analysis
The cp genomes of the eight species of Cyatheaceae were com- were clustered into one branch, which was located at the base of the phylogenetic tree, indicating that this group diverged earlier in this family. G. denticulata, G. podophylla, G. gigantea, and G. metteniana were clustered into one branch, which was located inside the branch of S. brunoniana and S. lepifera and was a sister group of the branch formed by A. spinulosa and A. costularis. Gymnosphaera was included in the genus Alsophila, two genera were defined, and only the difference in the GC content was significant (Table 4)

| Characteristics of the cpSSRs of the eight Cyatheaceae species
The cp plastomes of the eight Cyatheaceae species are very conservative and are similar in structure and gene content (117 genes).
The types and order of genes are the same. The lower distribution of SSRs (Figure 2a) in the IR region may be related to the higher mismatch repair rate and lower mutation rate in the IR region (Ellegren, 2004;Li et al., 2016). The lower GC content (Tables 2 and   3) of SSRs may be associated with the tendency of GC-rich regions toward AT mutations (Kuang et al., 2011;Ren et al., 2007). SSRs are mainly located in intergenic and noncoding regions, with a few present in exons (Li et al., 2004;Su et al., 2018), and the results of this study are consistent with this information. This phenomenon is related to negative selection against frameshift mutations in coding regions (Metzgar et al., 2000). CpSSRs are characterized by high   (Chmielewski et al., 2015;George et al., 2015); therefore, their sequences can be used to effectively classify low taxonomic and closely related groups and variant plant subspecies.
In this study, we used MISA to scan the recently assembled Cyatheaceae cp genomes for microsatellites of 1-6 bp. The cp genomes of eight Cyatheaceae species were similarly analyzed using the same bioinformatics tool and search parameters to compare our results. The number, relative abundance, relative density, and GC content of the cpSSRs of A. spinulosa and A. costularis; the number, relative abundance, relative density, and GC content of the cpSSRs of S. brunoniana and S. lepifera; and the number, relative abundance, relative density, and GC content of the cpSSRs of G. denticulata, G. podophylla, G. metteniana, and G. gigantea had similar values.
Based on these findings, the eight Cyatheaceae species were divided into three groups, which is consistent with the recent studies (Ching, 1978;Dong & Zuo, 2018;Janssen & Rakotondrainibe, 2008;Korall et al., 2007;Korall & Pryer, 2014;Smith et al., 2006), in which eight Cyatheaceae species were divided into three genera, indicating that the characteristics of the cpSSRs showed genus specificity at the genome level in the phylogenetic context of the three genera ( Figure 2, Tables 2 and 3, and Appendix Tables S9-S11).
The number, relative abundance, and relative density of the cpSSRs of Alsophila were lower than those of Gymnosphaera, while those of Sphaeropteris fell between the two. The numbers of SSRs of Alsophila, Gymnosphaera, and Sphaeropteris distributed in the LSC region were 81-82 (66.94%-67.21%), 38-45 (58.46%-68.85%), and 70-75 (67.96%-73.52%), respectively, which was the most in each species, followed by the SSC region, and the least in the IR region (Appendix Table S10).
The number, relative abundance, relative density, and GC content of SSRs of different unit length in Cyatheaceae cp genomes were also genus specific (Table 3; Appendix Tables S9-S11), which was especially true for mononucleotide and dinucleotide SSRs (Figure 3), possibly because of the lower content of SSRs of other unit lengths.
The number of mononucleotide repeats of Alsophila, Gymnosphaera, TA B L E 4 Significance test of the number, relative abundance, relative density, and GC content of SSRs in the whole cp genome, IGS, and LSC and mononucleotide SSRs in the whole cp genomes of the eight Cyatheaceae species  Tables S9-S11). Mononucleotide SSRs which exist in a large numbers in cp genomes (George et al., 2015;Liang et al., 2019) are the most abundant (Table 3), and A/T motifs are the most common (Figure 3; Appendix Table S9). This finding is similar to previously reported patterns of land plants (George et al., 2015;Huang et al., 2021;Ren et al., 2021;Vieira et al., 2014;Wei et al., 2021). However, in Polypodiaceae plastomes, most repeats were C/G mononucleotides. This increase in GC content may be related to the adaptation mechanism of Polypodiaceae to the environment (Gao et al., 2018;Liu et al., 2021). These results indicate that SSRs in cp genomes can reflect genetic variation between different taxa.
The distribution of different repeat types (from mononucleotide to hexanucleotide) of motifs in coding and noncoding regions, introns, and intergenic regions displayed a high degree of genus specificity (Appendix Tables S10 and S11), which can be partially explained by the interaction of mutation mechanisms and differential selection (Toth et al., 2000). The most common mutation mechanism affecting SSRs is slipped replication. Other mechanisms, such as unequal crossing over, nucleotide substitution, and duplication events, are also responsible for SSR variation (Hancock, 1999;Schlotterer & Tautz, 1992).
The SSRs of different groups of genomes have specific distribution patterns, which are related to their common ancestors. Evolutionary trends have been linked to the inclusion of SSRs, which may have been preserved because of their ability to adapt to novel regulatory mechanisms (Srivastava et al., 2019). Analysis of the characteristics of SSRs provides useful clues for the phylogenetic study of Cyatheaceae and facilitates an understanding of the evolution of SSRs in plant genomes. Dong and Zuo (2018) pointed out that Gymnosphaera and Alsophila were significantly different in morphological traits such as petiole F I G U R E 5 Clustering analysis of eight Cyatheaceae species based on the number, relative abundance, relative density, and GC content of SSRs across the whole chloroplast genome (a), IGS (b), LSC (c), and mononucleotide (d) SSRs in the chloroplast genomes color, the presence or absence of degenerated pinnae at the base,  (Dong & Zuo, 2018;Janssen & Rakotondrainibe, 2008;Korall & Pryer, 2014;Wang et al., 2003).
Comparative analysis of cpSSRs from broad plant groups may be useful for a better understanding of the diversity and evolutionary trends of cp genomes (George et al., 2015). This study indicates that the distribution characteristics of cpSSRs of Cyatheaceae can provide useful phylogeny information at the genus level. Relatively few studies have explored phylogenetic relationships in ferns by analyzing cpSSRs. Since the software programs that identify SSRs are limited by their efficiency and parameter settings and may also be affected by the quality of the SSR dataset generated, their accuracy requires improvement (Ellegren, 2004;Lim et al., 2013). Given the limitations of the current plant genome sequences, we did not analyze the large-scale SSR characteristics of the cp genomes of Cyatheaceae, nor did we obtain the plant materials of Cyathea that are generally distributed in South America. However, this study is based on the existing chloroplast genomes of Cyatheaceae and the eight species encompass most Cyatheaceae genera. Our results demonstrate that the distribution characteristics of the cpSSRs of the existing Cyatheaceae are genus specific. Our aim was not to solve the phylogenetic problem of Cyatheaceae but to identify traits that can be used for phylogenetics. This study provides a new basis for the classification of Cyatheaceae at the species and genus levels, thus advancing the phylogenetic study of Cyatheaceae. In the future, more genomic and transcriptomic data are needed to validate these results.

| CON CLUS ION
The cp genomes of the eight Cyatheaceae species have the same gene types in the same order and similar structure and gene content.
The distribution characteristics of the cpSSRs of the eight species are consistent with the recent classification of Cyatheaceae, which divides the eight Cyatheaceae species into three genera, indicating that in the phylogenetic context of the three genera, the distribution characteristics of SSRs in their cp genomes are genus specific, which may be a general rule among Cyatheaceae. Analyzing the characteristics of SSRs provides clues and new ideas for research on the phylogeny of Cyatheaceae.

ACK N OWLED G M ENTS
We would like to thank the Wuhan Botanical Garden, South China Botanical Garden, and Fairy Lake Botanical Garden of the Chinese Academy of Sciences for providing samples of Cyatheaceae plants.

CO N FLI C T O F I NTE R E S T
The authors declare no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The chloroplast genomes of Gymnosphaera denticulata Baker and