Loss of the IR region in conifer plastomes: Changes in the selection pressure and substitution rate of protein‐coding genes

Abstract Plastid genomes (plastomes) have a quadripartite structure, but some species have drastically reduced or lost inverted repeat (IR) regions. IR regions are important for genome stability and the evolution rate. In the evolutionary process of gymnosperms, the typical IRs of conifers were lost, possibly affecting the evolutionary rate and selection pressure of genomic protein‐coding genes. In this study, we selected 78 gymnosperm species (51 genera, 13 families) for evolutionary analysis. The selection pressure analysis results showed that negative selection effects were detected in all 50 common genes. Among them, six genes in conifers had higher ω values than non‐conifers, and 12 genes had lower ω values. The evolutionary rate analysis results showed that 9 of 50 common genes differed between conifers and non‐conifers. It is more obvious that in non‐conifers, the rates of psbA (trst, trsv, ratio, dN, dS, and ω) were 2.6‐ to 3.1‐fold of conifers. In conifers, trsv, ratio, dN, dS, and ω of ycf2 were 1.2‐ to 3.6‐fold of non‐conifers. In addition, the evolution rate of ycf2 in the IR was significantly reduced. psbA is undergoing dynamic change, with an abnormally high evolution rate as a small portion of it enters the IR region. Although conifers have lost the typical IR regions, we detected no change in the substitution rate or selection pressure of most protein‐coding genes due to gene function, plant habitat, or newly acquired IRs.


| INTRODUC TI ON
The plastid genomes (plastomes) of most land plants have a highly conserved quadripartite structure, consisting of a large single-copy (LSC) region and a small single-copy (SSC) region separated by a pair of inverted repeat (IR) regions (Wicke et al., 2011). Due to selective pressure on photosynthesis-related elements, plastids have highly conserved gene content and order (Ruhlman & Jansen, 2014). And due to their highly conservation, a large copy number, lack of recombination, and uniparental inheritance, plastomes have been used to evaluate phylogeographic relationships, phylogeographic histories, and evolutionary events (Barrett et al., 2016;Julian et al., 2017;Moore et al., 2006;Pacheco et al., 2019;Shaw et al., 2014;Wang et al., 2020).
The sequence substitution rate in the IR region differs from that in the SC region. In some legumes with missing IR regions, the synonymous substitution rate of genes entering the SC region from the IR region is similar to that of those already in the SC region (Perry & Wolfe, 2002). The IR region has a low substitution rate in Cycas (Wu & Chaw, 2015). Zhu et al. (2016) found that the synonymous substitution rate of genes in the SC region in vascular plants (angiosperms, gymnosperms, and ferns) is 3.7-fold that of genes in the IR region; after the transfer of genes between the SC and IR regions, the substitution rate becomes consistent. Some genes in the IR region have high synonymous substitution rates in some species (Pelargonium, Plantago, and Silene). In ferns, psbA, rps7, 3'-rps12, and ycf2 showed a reduced substitution rate and increased GC content after entering the IR region (Li, Kuo, et al., 2016). Recently, ,  reported that the substitution rate of 3'-rps12 in the IR region is lower than that of 5'-rps12 in the SC region in ferns and gymnosperms.
The IR region has important effects on genome stability and the evolutionary rate (Maréchal & Brisson, 2010;Palmer & Thompson, 1982;Wolfe et al., 1987). During gymnosperm evolution, conifers lost the IR region. Araucaria cunninghamii (Araucariaceae) is native to southeastern coastal areas of Oceania and is the main afforestation tree species in tropical and subtropical regions. It is one of the top five park tree species worldwide and an important garden ornamental plant, and it is cultivated widely in China. Callitropsis funebris (Cupressaceae) is endemic to China and distributed widely; it grows rapidly and has a wide range of uses and strong adaptability. Its phylogenetic classification is a focus of research (Zheng & Fu, 1978). As these two species are important representatives of their genera, we sequenced their plastomes. In this study, we selected 78 gymnosperms and 2 Polypodiales (outgroup) and analyzed selection pressure and the evolutionary rate using the maximum likelihood method. We investigated whether the selection pressure and evolutionary rate of protein-coding genes in conifers differed according to the presence of the IR region; whether the selection pressure and evolutionary rate of genes that enter the IR region differ from those that enter the SC region; and the heterogeneity of gymnosperm evolutionary rates.
In addition, 76 gymnosperms and 2 Polypodiales (as outgroups) were downloaded from the NCBI database. A total of 78 gymnosperms cover 13 families and 51 genera (Table 1). Genious prime 2020  (Kearse et al., 2012) software was used to extract 50 protein-coding genes. And performed sequence alignment and correction through the ClustalW (codons) module in MEGA X (Kumar et al., 2018).

| Analysis of phylogenetic relationships
We constructed the phylogenetic relationships based on the common gene dataset (the two Polypodiales as the outgroup). MEGA X and PAUP4.0 (Swofford, 2002) were used to construct the NJ (neighbor-joining) tree and MP (maximum-parsimony) tree, respectively. RaxmlGUI2 software built ML (maximum-likelihood) tree with GTRGAMMAI substitution model and 1,000 bootstrap (Stamatakis, 2014). BI (Bayesian inference) tree was built with MrBayes 3.2.6 software (Huelsenbeck & Ronquist, 2001). We reconstructed a background tree combined with the published phylogenetic relationship to analyze selection pressure and evolution rate, finally.

| Analysis of selection pressure
Codeml program in PAML4.9 was used to analyze the selection pressure (Yang, 2004). The M0 (one-ration) model assumes that each branch has the same value of ω under the Branch model. The Model 2 (two-ratio) assumes the foreground branch and the background branch have different ω values. The likelihood ratio test of M0 and Model 2 can be used to detect the difference of selection pressure between the foreground branch and the background branch.

| Analysis of evolutionary rate
We used HyPhy 2.2.4 software (Pond et al., 2005)

| Plastome characteristics
The 76 gymnosperms were divided into two types according to the presence of a typical IR region: 56 gymnosperms lacked a typical IR region (SC-56; conifer) and 22 gymnosperms had a typical IR region (gnetophytes, cycads, and Ginkgo biloba; IR-22; non-conifer).
For SC-56 (Appendix S1), the genome size ranged from 117,720 bp Compared with other groups, the plastome of the gnetophytes was the smallest, and the proportion of SSC region was also the smallest (7.2-9.3%), but the proportion of the IR region was largest (16%-18.9%). The three groups all contained the protein-coding genes rps7 and 3'-rps12 in the IR region, and ndhB was present in G. biloba and cycads. All plants, except G. biloba, possessed ycf2. Individually, 3'-psbA was present in Welwitschia mirabilis, Gnetum parvifolium, and Gnetum gnemon, rpl32 was detected in W. mirabilis, and rps15 was detected in E. equisetina ( Figure 1).
Rank sum tests indicated no significant differences in genome size (p = .202) and significant differences in GC content (p < .01) between IR-22 and SC-56.

| Phylogenetic analysis
The 50 protein-coding genes comprised 30 photosynthetic system genes, 15 genetic systems, and 5 other genes ( Table 3). The constructed phylogenetic results (Appendix S2) showed that the phylogenetic relationship constructed by the NJ method was the clearest, and each group formed a monophyletic branch. Using the BI and ML methods, we found that all groups, except the Pinales, formed a monophyletic branch. In the MP tree, the relationship between Cupressales and Araucariales was unclear.
The NJ (bootstrap value = 100) and MP (bootstrap value = 50) trees supported gnetophytes as the basic group of gymnosperms, and the MP (bootstrap value = 100) and BI (posterior probability = 1) trees supported gnetophytes and P. neoveitchii as sister groups. Based on previous reports, we accepted the Gnepine hypothesis, and performed a manual adjustment to obtain the phylogenetic tree for selection pressure and evolution rate analysis ( Figure 2).

| Evolutionary analysis based on the presence of a typical IR region
With IR-22 as the foreground branch and SC-56 as the background branch, likelihood ratio tests of M0 and Model 2 yielded 18 genes with significant differences (p < .05). That is, these 18 genes experience different selection pressure in IR-22 and SC-56. Among them, six genes had an ω value in IR-22 that were lower than SC-56, and the rest were higher (Table 4). We analyzed and tested the evolutionary rates of the 50 selected common genes (Appendix S3). Among these, we found that the related parameters of nine genes were significantly different between IR-22 and SC-56 (p < .05; Figure 3 F I G U R E 1 Comparison of IR regions between different taxa. Different color blocks represent different gene types. In species with typical IR regions, there were two protein-coding genes (rps7 and 3'-rps12) located in the IR region. In Ginkgo biloba, the ycf2 was located in the LSC region due to contraction of the IR region. Gnetophytes lost ndh genes, and in Welwitschia mirabilis, 3'-psbA entered IR region

| Evolutionary analysis based on genes entering the IR region
The IR regions of 22 species contained the protein-coding genes 3'-rps12 and rps7 (Figure 1). In W. mirabilis, G. parvifolium, and G. gnemon, part of the psbA (3'-psbA) entered the IR region; thus, it was divided into two categories (IR-3 and SC-75). In G. biloba, ycf2 has been left from the IR region; thus, it was divided into two categories (IR-21 and SC-57).
Likelihood ratio test results revealed that the selection pressure differed among categories. For psbA, ω IR-3 was 0.03474, with IR-3 as the foreground branch, and ω SC-75 was 0.0001, with SC-75 as the background branch. For ycf2, ω IR-21 was 0.76124, with IR-21 as the foreground branch, and ω SC-57 was 0.67832, with SC-57 as the background branch. The trsv, ratio, and ω values of psbA were significantly higher (P trsv = .048, P ratio = .003, and P ω = .027) in the IR region than in the SC region, and the evolutionary rate of ycf2 was significantly lower (p < .05) in the IR region than in the SC region ( Figure 4).

| Changes in genome size
Evolution has decreased the gymnosperm genome size. Conifers and non-conifers have genomes of similar sizes, which is related to the large compact genome of gnetophytes (115,022-109,518 bp).
This compact genome may increase the gnetophyte survival rate in harsh and competitive environments (Wu et al., 2009). After the removal of gnetophytes, a significant difference was detected between groups (p < .01). The shrinkage of the conifer plastome is a result mainly of the loss of the IR region (IRb in Pinaceae and IRa in cupressophytes). In addition, the reduction in the cupressophyte plastome size may be caused by intergenic shrinkage resulting from the mutational burden (Wu & Chaw, 2014). Pinaceae and cupressophytes have lost their typical IRs during evolution, but have acquired one or more short, novel IRs, some of which also exhibit recombination to generate genomic structural diversity (Guo et al., 2014).
Compared with that of ginkgo, the IR region of other nonconifers has expanded. Although the IR region in gnetophytes is smaller than that in cycads, it accounts for a large proportion of the genome and contains more genes. Gnetophytes lost the ndh gene and non-coding sequence, resulting in a compact, more economical genome (Wu et al., 2009). In Pinaceae, the IR comprises only the full trnI-CAU gene and, in most species, a portion of the psbA gene (Lin et al., 2010;. The GC content was increased in non-conifers, which may be related to a GC bias in gene conversion in the IR region. Indeed, the IR region of cycads contains more A/T to GC substitutions (Wu & Chaw, 2015).

| Phylogenetic location of gnetophytes
The four phylogenetic trees constructed using a tandem dataset of 50 common protein-coding genes were not consistent, and the application of different algorithms yielded different topological structures. Compared with those produced by NJ and MP, the phylogenetic relationships constructed by ML and BI are more accurate (Hall, 2005). The ML and BI trees showed that gnetophytes and P. neoveitchi were sister groups with a high degree of support.
Although Pinaceae was a non-monophyletic group, it was closest to the gnetophytes, supporting the Gnepine hypothesis.
The phylogenetic position of the gnetophytes has been debated (Mathews, 2009;Palmer et al., 2004). Burleigh and Mathews (2004) examined 13 genes in gymnosperms (five plastid, four nuclear, and four mitochondrial) and almost 19 000 nucleotides, providing evidence for the Gnepine hypothesis. Wu and Chaw (2014)  with 1308 loci based on transcriptome data, which also supported the Gnepine hypothesis.  constructed ML and BI trees using rbcL and matK, which supported the classification of gnetophytes and Pinaceae as sister groups.

| Heterogeneity of evolutionary rates
The change in evolutionary rate was independent of the presence of the IR region, but rather indicates an increased substitution rate of gnetophytes (especially W. mirabilis) and a lower evolutionary rate of conifers. The evolutionary rate of four genes was increased in nonconifers. The increased substitution rates of rps8, psbE, and psbA were related to the low evolutionary rate of conifers. The petG value was abnormally high. The dN or dS value for most species branches was 0, resulting in false-positive results. By contrast, the substitution rates of rps4 and ycf2 were decreased in non-conifers, which is related to their generally low evolutionary rate. The evolutionary rate of most genes in W. mirabilis was high. The phylogenetic tree constructed using 51 common genes showed that the gnetophytes, especially W. mirabilis and Ephedra sinica, had high evolutionary rates. McCoy et al. (2008) reported that about 75% of plastid protein-coding genes showed high substitution rates in W. mirabilis. The substitution rates of most protein-coding genes did not differ significantly between conifers and non-conifers, which may be related to the newly acquired short IR region of conifers (Hirao et al., 2008;Wu & Chaw, 2014;Yi et al., 2013). In Cephalotaxus oliveri, the 544-bp IR (repeated trnQ-UUG gene) was inferred to have recombination activity (Yi et al., 2013). Gnetophytes have a higher substitution rate than do other gymnosperms (Ran et al., 2018;Wang et al., 2015;Wu et al., 2009), possibly because of higher mutation rates, changes in selective pressure, increased fixation of mutations by genetic drift, and biological characteristics (Fry & Wernegreen, 2005;Lanfear et al., 2010). Wu et al. (2009) proposed that the high frequency of AT-rich codons in gnetophytes leads to a higher substitution rate. Wang et al. (2015) proposed that the generation time and plant height underlie the increased substitution rate of gnetophytes (Lanfear et al., 2010(Lanfear et al., , 2013. Ran et al. (2018) found that gnetophytes and angiosperms have similar rates of molecular evolution, which are higher than those of other gymnosperms, suggesting that gnetophytes and angiosperms experienced similar selection pressure during their evolutionary histories.
As conifers are typically taller than non-conifers, the long-term rate of mitosis in the apical meristem is slower, the frequency of DNA replication is lower, and the accumulated errors per unit time are reduced, resulting in low mutation and substitution rates (Lanfear et al., 2013).

| Negative selection pressure
Genes in the IR region experience strong negative selection . The IR region is important for genome structural stability, and genomes lacking an IR region may experience different natural selection effects. However, the selection pressure of most genes did not differ significantly, indicating that they are not affected by the IR region. When ω < 1, smaller ω values reflect stronger negative selection pressure (Yang, 2006).
The selection pressure of 18 genes differed between conifers and non-conifers, and their ω values were <1 (Table 4). Compared with conifers, six genes (four PSII genes, one cytochrome gene, and one c-type cytochrome synthesis gene) showed lower ω values in non-conifers, and 12 genes (three PSI genes, one PSII gene, six genetic system genes, and two other genes) showed higher ω values in non-conifers. The lower ω values of these genes indicates the greater the negative selection effect, which may be more conducive to their function in various groups. The differences in selection pressure between conifers and non-conifers may be related to plant height; non-conifers are typically shorter than conifers.
Efficient photosynthesis is required to obtain enough sunlight, necessitating negative selection pressure on the genes encoding the photosynthetic machinery. However, some such genes in conifers experienced negative selection pressure, which may be related to their structure or function or to habitat differences. The negative selection exerted by the habitats of gnetophytes leads to genespecific reductions in dN/dS in their plastomes and genomes (Wang et al., 2015).

| Reduced substitution rate in the IR region
Genes in the IR region tend to have a low substitution rate (Li, Kuo, et al., 2016;Li, Gao, et al., 2016;. In this study, ycf2 in the IR region showed a reduced substitution rate (Figure 4)  and ω of petG. Compared with IR-22, the genes with higher evolution rates in the SC-56 were trsv, ratio, dN, dS, and ω of ycf2; trsv of rps4; and ratios of rpoC1, rbcL, and cemA F I G U R E 4 The evolutionary rate of psbA and ycf2 in the different categories.
(a) For psbA, the mean evolutionary rate of IR-3 and SC-75. B: For ycf2, the mean evolutionary rate of IR-21 and SC-57. *Rank sum test, p < .05. The trsv, ratio, and ω values of psbA were significantly higher in the IR region than in the SC region, and the evolutionary rate of ycf2 was significantly lower in the IR region than in the SC region IR region in three gymnosperm species and shows increased substitution rates ( Figure 4) and weak selection pressure. These characteristics differ from those of other genes; the most reasonable explanation is that this gene is undergoing dynamic changes in gymnosperms, resulting in changes in its substitution rate and selection pressure.
IR regions may play an important role in the maintenance of genomic stability (Maréchal & Brisson, 2010), such as by replication initiation (Heinhorst & Cannon, 1993), genome stabilization (Palmer & Thompson, 1982), and gene conservation (Palmer & Thompson, 1982;Wolfe et al., 1987). IR loss is related to structural recombination. In some taxa lacking an IR region, plastids underwent significant structural changes, including gene and intron loss, multiple inversion, and translocation and duplication of plastome segments (Jin et al., 2020;Hirao et al., 2008;Mower & Vickrey, 2018). Sequences in the IR region often have a reduced substitution rate, which is related to the region's characteristics. Due to its double-copy nature, the frequency of gene conversion is higher, and some mutation sites are repaired, resulting in a low substitution rate (Birky & Walsh, 1992;Khakhlova & Bock, 2006).

ACK N OWLED G EM ENTS
We thank the National Natural Science Foundation for its support (31670200, 31770587, 31872670, and 32071781).

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data source is NCBI database: