Ancestral gene duplications in mosses characterized by integrated phylogenomic analyses

Mosses (Bryophyta) are a key group occupying an important phylogenetic position in land plant (embryophyte) evolution. The class Bryopsida represents the most diversified lineage, containing more than 95% of modern mosses, whereas other classes are species‐poor. Two branches with large numbers of gene duplications were elucidated by phylogenomic analyses, one in the ancestry of all mosses and another before the separation of the Bryopsida, Polytrichopsida, and Tetraphidopsida. The analysis of the phylogenetic progression of duplicated paralogs retained on genomic syntenic regions in the Physcomitrella patens genome confirmed that the whole‐genome duplication events WGD1 and WGD2 were re‐recognized as the ψ event and the Funarioideae duplication event, respectively. The ψ polyploidy event was tightly associated with the early diversification of Bryopsida, in the ancestor of Bryidae, Dicranidae, Timmiidae, and Funariidae. Together, four branches with large numbers of gene duplications were unveiled in the evolutionary past of P. patens. Gene retention patterns following the four large‐scale duplications in different moss lineages were analyzed and discussed. Recurrent significant retention of stress‐related genes may have contributed to their adaption to distinct ecological environments and the evolutionary success of this early‐diverging land plant lineage.


Introduction
Gene duplications provided the genetic raw materials for evolutionary innovation and are considered important driving forces in diversification and evolution (Ohno, 1970;Van de Peer et al., 2017), and the evolutionary past of land plants is characterized by recurrent ancestral polyploidy events (Blanc & Wolfe, 2004;Cui et al., 2006;Van de Peer et al., 2017). The duplicated genes in Arabidopsis can be traced back to five rounds of polyploidy, the α-β-γ-ε-ξ series, where the ε event occurred in the ancestor of angiosperms and ξ event in the ancestor for all seed plants (Blanc et al., 2003;Bowers et al., 2003;Jiao et al., 2011); however, this is still controversial (Ruprecht et al., 2017). Echoing the α-β-γ series, whole-genome duplication (WGD) series, termed ρ-σ-τ, was reported in grasses (Jiao et al., 2014). The gamma event appears associated with the species radiation of more than 75% of angiosperms (Friedman, 2009;Jiao et al., 2012;Vekemans et al., 2012). Large-scale transcriptomic analyses also revealed that hundreds of WGDs may have occurred in the evolutionary past of different green plant lineages (Leebens-Mack et al., 2019). However, both the phylogenetic relationships and the evolutionary history of mosses, an important clade in early-diverging land plant lineages (Morris et al., 2018) with over 12 000 species (Goffinet & Buck, 2004;Shaw et al., 2011), are less investigated.
A polyploidy event, characterized by a K S peak around the chromosome-scale assembly of the P. patens genome deciphered the event as two recurrent WGDs: the younger WGD2 with a K S peak around 0.5-0.65 and the older WGD1 with a K S peak around 0. 75-0.9 (Lang et al., 2018). Paralogous K S frequency distributions in a number of moss transcriptomes indicated widespread ancestral genomic duplications in mosses (Lang et al., 2018). Multiple genome duplications in peat mosses (class Sphagnopsida) and one ancestral polyploidy event that occurred in the ancestor of mosses were also reported (Devos et al., 2016). High-throughput sequencing technologies generated adequate transcriptome data for discovery of ancestral largescale genome duplications by analyzing K S -based age distributions (Blanc & Wolfe, 2004;Cui et al., 2006) and/or large-scale gene tree-based phylogenomic approaches (Jiao et al., 2011;Jiao et al., 2012;Yang et al., 2018). For mosses, transcriptomes for 11 moss species were generated in the pilot study for the 1000 Green Plant Transcriptome Project (1KP; https://www.onekp.com; Wickett et al., 2014), and the transcriptomes of Syntrichia caninervis and Bryum argenteum have been reported in our previous studies (Gao et al., 2014;Gao et al., 2017). Recently, a phylotranscriptomic study in pleurocarpous mosses (Hypnales) generated a large set of transcriptomic data . Despite the rapid growing body of sequencing data in mosses, the precise phylogenetic positioning of large-scale genome duplication events remains unresolved in Bryopsida, the largest class of mosses (Newton et al., 2006). The detection of these ancestral duplication events has been hampered by the unresolved phylogenetic relationships of the Bryopsida with other classes of mosses; however, phylogenetic relationships have been proposed (Shaw et al., 2010a(Shaw et al., , 2010b(Shaw et al., , 2011. The P. patens chromosome structure and intragenomic synteny comparison provided unequivocal evidence for two recurrent ancestral WGDs (Lang et al., 2018), but the precise phylogenetic position for the two events was unresolved. Whether or not the WGDs were shared with other mosses or simply species-specific polyploidy events remained an open question. Although intergenome synteny comparison remains unavailable at present due to the lack of a closely related moss genome, resolving the phylogenetic placement of the two P. patens WGDs could be achieved by investigating the phylogenetic timing of the duplicated syntelog pairs. For example, the precise phylogenetic timing of the core-eudicot gamma triplication event was successfully resolved by tracing and summarizing the phylogenetic timing of Vitis syntelogs (Jiao et al., 2012).
To resolve the phylogeny of the mosses and phylogenetic positioning of ancestral genomic duplication events, we used phylotranscriptomic analyses of augmented transcriptomic data to generate reliable species phylogenies (Wickett et al., 2014;Ran et al., 2018;Unruh et al., 2018). This was coupled with a large-scale phylogenomic analysis investigating duplication signals from thousands of gene trees to determine the phylogenetic positioning of ancestral genomic duplication events to avoid the negative effects of "L shaped" background duplications that impact paranomebased K S frequency analyses (Jiao et al., 2011;Devos et al., 2016;Johnson et al., 2016).

Clustering transcripts into homolog clusters
To cluster transcripts into homolog clusters, all-by-all pair-wise bidirectional BlastP (Buchfink et al., 2015) comparisons (e-value ≤ 1e-5) were conducted among the 29 transcriptomes. Only alignments with more than 50% coverage were introduced into the OrthoFinder v2.2.6 pipeline (Emms & Kelly, 2015). Homolog clusters were required to have at least one sequence from each of the three genomes (P. patens, M. polymorpha, and S. moellendorffii); homolog clusters with less than 10 taxa were discarded and 5684 homolog clusters were retained.
Protein sequences in each homolog cluster were aligned using MAFFT v7.310 (Katoh & Standley, 2013), employing the L-INS-i strategy and 5000 iterations. Coding regions were forced on the protein sequence alignment using Phyutility v2.2.6 and trimmed by a minimal column occupancy of 0.1 (Smith & Dunn, 2008). Gene trees for each homolog cluster were constructed using FastTree v2.1.10 (Price et al., 2010) with 1000 bootstrap replicates. At this stage (Fig. S1), gene clusters may include alternatively spliced isoforms and/or recent paralogs (high sequence identity). The Yang and Smith pipeline (Yang & Smith, 2014) was employed to inspect monophyletic or paraphyletic clades in each tree consisting of genes from a single taxon, and these clades were reduced to retain the longest unambiguous alignment, thus masking the transcript isoforms and recently duplicated paralogs in each transcriptome. The original protein sequences in each gene cluster (before trimming) were realigned using MAFFT v7.310 (Katoh & Standley, 2013). The alignments were trimmed and back-translated using TRIMAL v1.4 (Capella-Gutierrez et al., 2009). The generated CDS alignments for homolog clusters were divided into two subsets: one containing the putatively one-to-one orthogroups (one gene from each species) for the species tree estimation and the other containing multicopy homolog clusters, to identify ancestral gene duplications.

Species tree construction
Homolog clusters were masked to maximize the number of one-to-one orthogroups (Table S2) and utilized for species tree estimation using both concatenated sequence supermatrix and coalescence-based summary methods Unruh et al., 2018). However, we recognize that some ancient paralogs may be misidentified as orthologs due to lineage or species-specific nonorthologous gene losses (Robertson et al., 2017). Reduced coding sequence alignments for the 649 one-to-one orthogroups were concatenated and gaps were inserted where taxa were missing. Three partition strategies were employed: (i) original concatenated alignment unpartitioned; (ii) one partition for the first and second codon positions, and a separate partition for the third codon position; and (iii) a recommended partition scheme (PartitionFinder v2.1.1;Lanfear et al., 2017) that treated each codon position as an individual partition. The best-fit partitioning scheme divided concatenated alignment into 611 partitions. The concatenated species tree was estimated using RAxML v8.2.12 (Stamatakis, 2014) under the GTRGAMMA evolutionary model and 500 bootstrap replicates with S. moellendorffii as the outgroup. To validate the species tree, we constructed trees using the first and second codon positions and the concatenated protein alignments under the PROTGAMMAAUTO model.
For the coalescence-based summary approach, individual maximum-likelihood (ML) unrooted gene trees for CDS alignments were estimated for each of the 649 single-copy orthogroups using RAxML v8.2.12 (Stamatakis, 2014) under the GTRGAMMA model with 500 bootstrap replicates. The ASTRAL-III v5.6.2 algorithm  was used to estimate the coalescence-based species tree from individual gene trees, and support values were calculated from 100 replicates of multi-locus bootstrapping.

Phylogenetic placement of ancestral gene duplication events
The 5035 multicopy homolog clusters were employed for the identification of ancestral gene duplications. Maximumlikelihood gene trees were constructed for each homolog clusters using RAxML v8.2.12 under the GTRGAMMA model with 100 rapid bootstraps and with the gene sequences from Selaginella or Marchantia as outgroups.
The Phylogenetic Placement of Polyploidy Using the Genomes (PUG) algorithm (McKain et al., 2016;Unruh et al., 2018) was employed to identify paleopolyploidy events using the species tree generated from the concatenated sequence supermatrix as a guide. The PUG algorithm, based on species-tree to gene-tree reconciliation, was run for each of the 5035 gene trees using the "estimate_paralogs" option to count unique duplication nodes. For each paralogous pair, the coalescence node (potential duplication node) was identified from the gene family phylogeny, and species composition for both the duplication clades (subtrees) and the sister lineage was queried and scrutinized to count the duplication events for the internal nodes in the species tree. The placement was considered acceptable for the duplication of that paralogous pair only if the combined species composition of the subtree and the sister lineage resolved to the same node in the species tree (McKain et al., 2016;Unruh et al., 2018). Results from PUG were filtered for a minimal bootstrap value of 50 and 80 for each paralogous pair coalescence node.
Paralogous pairs retained on syntenic blocks (syntelogs) in the P. patens genome were identified using the McScanX algorithm (Tang et al., 2008;Wang et al., 2012) and their phylogenetic timing was traced from corresponding gene family phylogenies, similar to the strategy employed to analyze the core-eudicot GAMMA event in eudicots (Jiao et al., 2012). Specifically, the P. patens syntelog pairs were input as paralogs file in PUG to obtain their coalescence nodes in corresponding phylogenetic trees.

Synonymous divergence estimation of duplicated genes
Divergence patterns for duplicated genes were explored by calculating the synonymous substitutions per synonymous site (K S , also known as D S ) for each paralogous pair. To capture the "pure" K S peak signal of ancestral duplications, the paralogous pairs with a coalescence node at focal duplication nodes were extracted from the gene trees with a minimum bootstrap support value (BSV) of 50%. Protein sequence alignments were generated (Edgar, 2004) and back-translated (Suyama et al., 2006) for each duplicated sequence pair, and then alignment gaps and stop codons were removed. K S values for each paralogous gene pair were calculated using KaKs_Calculator v2.0 software (Wang et al., 2010) with the "GY" model (Goldman & Yang, 1994). The derived paralogous K S values in each species were further analyzed using EMMIX (McLachlan et al., 1999) with one to ten components, using 100 random and 10 k-means starting values. The Bayesian Information Criterion was used to select the number of components that best fit the data.
2.6 Functional enrichment analyses for homolog groups with ancient duplications Physcomitrella patens genes were annotated using Inter-ProScan v5.34-73.0 (Jones et al., 2014) and Blast2GO v5.2.5 (Jones et al., 2014) to obtain gene ontology (GO) labels. Assuming that all genes clustered into the same homolog cluster share similar functions (Jiao et al., 2011;Ren et al., 2018), the GO labels of each P. patens gene were assigned to each of the corresponding homolog clusters. The functional enrichment analyses were subsequently performed for homolog clusters containing ancient duplications at the focal nodes. To assess the overrepresented GO categories for each homolog cluster, the Cytoscape v3.7.1 (Shannon et al., 2003) plugin BiNGO v3.0.3 (Maere et al., 2005b) was utilized to conduct a hypergeometric analysis and Benjamini and Hochberg False Discovery Rate (FDR) multiple test correction (Benjamini & Hochberg, 1995). The goslim_plant ontology vocabulary was selected as the namespace. GO terms with statistical FDR ≤ 0.05 were considered as significantly overrepresented. The statistical results were depicted with heatmaps using the pheatmap package in the R statistical environment.

Transcriptomes and homolog clusters
A total of 29 species (Fig. 1) were included for the phylogenomic analyses, including 12 de novo assembled transcriptomes (Table S1). The nonredundant protein sequences for each species (listed in Table S1) ranged from 11 140 (Rhodobryum ontariense) to 45329 (Eosphagnum inretortum). A total of 643 020 nonredundant peptide sequences from the three genomes (Physcomitrella patens, Marchantia polymorpha, and Selaginella moellendorffii) and 26 de novo transcriptomes were sorted into 31 779 homolog clusters, which reduced to 5684 by retaining only homolog clusters that included a minimum of 10 taxa and at least one sequence in each of the three reference genomes. The 5684 homolog clusters were collapsed to retain only one representative sequence for mono-and paraphylogenetic clades that contained sequences from single taxa, and terminal branches with abnormal lengths were eliminated (Fig. S1). The 649 one-to-one orthogroups were used for species tree estimation and the remaining 5035 multicopy homolog groups for phylogenetic placement of WGD events. The masking of transcript isoforms and recent paralogs did not alter the detection of WGDs (Figs. S2 and S3), and the trimmed gene family trees removed spurious terminal branches, eliminating unreliable members (Yang & Smith, 2014).
3.2 Moss phylogeny reconstructed from single-copy orthogroups Alignments from the 649 putative single-copy gene families were concatenated to generate a supermatrix that contained 833 313 sites. The contributions of each species to the singlecopy orthogroups are presented in Table S2. With the exception of the placement of Hypnum imponens and Anomodon rostratus, relative to Thuidium delicatulum within Hypnales and the clade of Polytrichopsida plus Tetraphidopsida (Fig. S4), the ML species tree produced a well-resolved moss phylogeny with BSVs of 100% for all other nodes. Although the BSVs for the three nodes in Hypnales were not 100%, all were larger than 90% in the three partition schemes (Fig. 1). The partitioning scheme recommended by PartitionFinder produced a moss phylogeny with better BSV support values for the three nodes in Hypnales nodes (BSV > 95%), supporting the phylogeny presented in Fig. 1. The Polytrichopsida plus Tetraphidopsida clade was supported when a species tree was constructed using partitioning scheme II (Fig. S5), and the concatenated protein alignment (Fig. S6) also generated the identical species tree topologies with 100% BSVs.
A coalescence-based species tree was estimated from the 649 individual gene family trees using the ASTRAL-III v5.6.2 algorithm . With the exception of the detailed deep phylogeny in Hypnales, the ASTRAL-III species tree topologies were consistent with the phylogeny reconstructed from concatenated supermatrix (Fig. S4). The Polytrichopsida plus Tetraphidopsida monophyletic clade was supported in the ASTRAL-III analysis (BSV of 100%).
Overall, the species tree topologies were consistent with the accepted moss taxonomy (Shaw et al., 2011) and with the recently resolved ordinal phylogeny of mosses (Liu et al., 2019). Takakia lepidozioides (Takakiopsida) was confidently placed as sister to all other mosses, followed by the Sphagnopsida, as concluded in the study of Liu et al. (2019) and suggested by others (Nickrent et al., 2000;Qiu et al., 2006;Shaw et al., 2010a;Fig. 1). Our analyses resolved Takakiopsida as a sister to all the other mosses, with 100% bootstrap support in all ML and ASTRAL-III analyses.
Within the species-rich class Bryopsida (true mosses), the subclasses Funariidae and Timmiidae (Timmia austriaca) were recognized as graded sister groups to the clade of Dicranidae plus Bryidae. These four subclasses together constituted a large proportion (>95%) of modern mosses (Newton et al., 2006) and demonstrated a relatively short ancestral branch length, indicative of a rapid species diversification process (Fig. 1). For expediency, we refer to them as the BDTF clade (Bryidae, Dicranidae, Timmiidae, and Funariidae). The other species-poor Bryopsida subclasses Diphysciidae (Diphyscium foliosum) and Buxbaumiidae (Buxbaumia aphylla) were resolved as laddered sister groups to the BDTF clade. The high-confidence supermatrix-based species tree topology generated here ( Fig. 1) was employed in subsequent phylogenomic analyses for the detection of ancestral gene duplication events.
3.3 Phylogenetic placement of ancestral gene duplications K S frequency plots have been used to detect signals for ancestral WGDs for mosses (Rensing et al., 2007;Szovenyi et al., 2015;Lang et al., 2018). Lang et al. (2018) detected K S peak signals in most Bryopsida and Sphagnopsida species and circumscribed two rounds of WGDs based on chromosomal synteny in the P. patens genome. However, detection of WGD events using paranome-based K S distributions is difficult (Vekemans et al., 2012), and the saturation effect of K S and rate heterogeneity among gene families and lineages can blur K S -based age peaks (Jiao et al., 2011). A large-scale gene tree-based phylogenomic method (Jiao et al., 2011) avoids such issues, and it was used successfully in WGD studies (Jiao et al., 2012;Devos et al., 2016;McKain et al., 2016;Unruh et al., 2018).
We examined gene trees constructed from the 5035 multicopy homolog groups using the PUG algorithm (McKain et al., 2016;Unruh et al., 2018), based on the constructed moss phylogeny, to detect six focal nodes with large numbers of gene duplications, as highlighted by the PUG algorithm (Fig. 2). The ancestral node with the largest number of gene duplications was identified at the ancestor of the BDTF clade (N20 in Fig. 2, 988 duplications BSV ≥ 80%). This supported a genome duplication event shared by all species in BDTF, designated as the ψ event or BDTF duplication. Large-scale duplication events were also indicated in the ancestor of the Funarioideae (N4 in Fig. 2, 794 duplications BSV ≥ 80%), which were not shared with Encalypta streptocarpa (Encalyptaceae) and in the ancestor of the Bryopsida, Polytrichopsida, and Tetraphidopsida (BPT) clade (N23 in Fig. 2, 265 duplications BSV ≥ 80%). The analyses captured the Sphagnopsida-wide genome duplication event (N1, 401 duplications BSV ≥ 80%) and the ancestral moss-wide duplication (N26, 357 duplications BSV ≥ 80%) reported earlier (Devos et al., 2016). Our analyses suggested that T. lepidozioides (Takakiopsida) was included in the moss-wide genome duplication event. The analyses using the BSV ≥ 50% were also labeled and highlighted at each internal node and yielded results consistent with BSV ≥ 80% (Fig. 2). It should be noted that we cannot exclude the possibility that the large numbers of gene duplications detected on N26 and N23 are derived from accumulations of individual gene duplication, as genomic structural evidence is not available. The large number of gene family duplications occurring before the rapid speciation in Hypnales was also captured; however, the number of detected gene duplications is low (N14, 149 duplications with BSV ≥ 80%) and does not strongly support a recent WGD event. A comprehensive analysis of gene family expansions in the Hypnales was reported in the study of Johnson et al., 2016; our work only corresponds to ancestral gene duplications (as a part of the expansions) that are supported by the generated gene family phylogenies.
3.4 Two WGDs in P. patens Phylogenetic timing of duplicated paralogs retained on genomic syntenic regions in the P. patens genome was summarized in the moss phylogeny ( Fig. 3 and Table 1). More than 95% of the syntelog pairs were phylogenetically identified as products of the N20 (BDTF) and N4 (Funarioideae) duplications. Using P. patens syntelogs as anchors, 93 and 114 duplication nodes (BSVs of 80% and 50%) were detected in the ancestor to the BDTF clade. Similarly, 287 and 339 duplications (BSVs of 80% and 50%) were detected in the ancestor of Funarioideae (shared by Physcomitrella and Funaria). All other ancestral nodes contained few (less than 1.5%) of the duplicated syntelogs (Table 1).
Syntelogs derived from N20 and N4 duplications were connected using CIRCOS v0.69-6 (Krzywinski et al., 2009;Figs. 3A, 3B) and compared to the P. patens chromosomal syntenic karyotype (Lang et al., 2018). The spatial distribution of the duplicated syntelogs provided strong evidence that the older WGD1 is congruent to the BDTF clade (N20) duplication, whereas the younger WGD2 is consistent with the Funarioideae The phylogenomic analyses provided robust evidence that the ancestral ψ genome duplication event occurred before diversification within the BDTF clade and the data are consistent with the reported WGD1 event in P. patens (Lang et al., 2018) and the WGD2 event in the ancestor of the Funarioideae.
3.5 Synonymous divergence of P. patens syntelogs from the Funarioideae and ψ duplications To evaluate the sequence divergence of the syntelogs after paleopolyploidy, we generated and analyzed the K S frequency distributions for the syntelog pairs that traced the Funarioideae and ψ duplication events in corresponding gene family phylogenies (Fig. 3). The observed K S frequency distribution peaks were consistent with the phylogenetic placement results, as described above, and the Funarioideae and ψ duplications are in parallel with major K S peaks around 0.6 (Figs. 3C, 3D) and 0.9 (Figs. 3E, 3F), respectively.
The synonymous divergence curves for syntelog pairs derived from both Funarioideae and ψ duplications were obviously right-skewed and the K S frequency distributions could be modeled with two major components containing one "younger" peak component and one "older" rightskewed component (blue and green for Funarioideae duplications in Figs. 3C, 3D, red and purple for ψ duplications in Figs. 3E, 3F). The analysis using the BSV threshold at 50% for each component was described by "color/mean/ proportion," in agreement with previous analyses (Jiao et al., 2011(Jiao et al., , 2012. The K S frequency distribution of syntelog pairs derived from the Funarioideae duplications was modeled with three components: blue/0.59/0.67, green/ 0.88/0.27, and black/1.36/0.06 (Fig. 3C), and the K S frequency of ψ event syntelogs contained two statistically significant components: red/0.9/0.775 and purple/1.5/0.225 (Fig. 3E). Thus, the green component (Figs. 3C, 3D) in the Funarioideae duplications could severely blur the major component of the ψ duplications, as shown in Fig. 3G. This synonymous distance distribution pattern would make it difficult to infer ancestral genome duplications directly from the K S distribution without the phylogenetic timing information. We also analyzed the K S frequency distribution for all syntelogs in Fig. 2. Detection of focal nodes with high concentrations of ancestral paralogous gene duplication events. Mapping results from querying paralogous pairs identified from gene-tree-species tree reconciliation using the Polyploidy Using the Genomes algorithm. The number of duplication nodes with bootstrap values no less than 80% and 50% were counted and labeled at corresponding ancestral nodes on the species tree. Colored branches highlighted the number of unique gene duplications as depicted in the key. P. patens; the four major K S components were recaptured (Fig. 3H) and were similar to that reported in the study of Lang et al., 2018. We described each component using "component no./mean/proportion": 1/0.56/0.21, 2/0.76/0.33, 3/1.09/0.41, and 4/1.8/0.05, and according to the K S distribution analyzed above. The youngest component (1/0.56/0.21) is largely in parallel to the Funarioideae duplication event. However, the other two major components (2/0.76/0.33 and 3/1.09/0.41) could be a mixture of both Funarioideae and ψ duplications.
3.6 Nature of the ψ duplication event To validate the "two-component" divergence pattern of the ψ genes, we extracted all paralogous pairs with a coalescence node at N20 (BSV ≥ 50%) tracing the ψ events in gene family trees. K S values were estimated for these "low background" sets of ψ duplicated paralogous pairs. In parallel with the age estimations (Fig. S7), K S frequency distributions for the BDTF clade species were modeled with two significant components (Fig. S7, A-R); for example, the two major components "red/ 0.9/0.51" and "purple/1.26/0.4" seen in the P. patens syntelog Fig. 3. Continued analysis (Figs. 3E, 3F) were identified in the K S distribution for the ψ duplicated genes inferred from gene trees (Fig. S7L). The inferred P. patens paralogous pairs with a coalescence node N4 (BSV ≥ 50%) tracing the Funarioideae duplication event, two major components, "blue/0.63/0.66" and "green/0.99/0.3," were identified (Fig. S7T), which was consistent with the syntelog-based analyses. Synonymous substitution rate frequency distributions of orthologs between P. patens and other BDTF mosses demonstrated slightly younger peaks around 0.54-0.86 (Fig. S8), further confirming that the ψ event occurred before the diversification of the BDTF clade. Heterogeneity in evolutionary rates between different gene families and K S saturation effects may partially explain the distinctive "two components" in the age distribution of the ψderived duplicates. Several studies suggested that the rightskewed K S distribution might be better modeled using a lognormal distribution (Cui et al., 2006;Zwaenepoel & Van de Peer, 2018), but log-transformed K S could not perfectly resolve the two recurrent WGDs in P. patens (Lang et al., 2018), because the two peaks are close to each and mixed together. The two components are not the result of two recurrent genome duplication events or a triplication event as that would have been detected in P. patens (Lang et al., 2018). It is possible that genome segments were duplicated before the tetraploidization of the two subgenomes that constitute the ψ polyploidy event; however, specific genomic segments that were significantly older than others were not observed.
Moreover, we cannot fully rule out the possibility that the older component of the ψ duplications might be the result of errors introduced in the large-scale phylogenomic procedure, especially when a number of transcriptomes were included. It is possible that some "older" paralogous pairs might be misidentified on a "younger" node when a sequence from the sister taxa was absent due to gene losses and the incompleteness of transcriptomes. For example, when the genes from Encalypta were not expressed in the transcriptome, the paralogs duplicated in the ancestor of P. patens, Funaria, and Encalypta would be placed in the ancestor of P. patens and Funaria. Nevertheless, the phylogenomic analyses, together with the convincing genome synteny structural evidence, probed the accurate phylogenetic position of the two WGD events.
It is possible that the ψ duplication was an allopolyploidy event resulting from dioecious female and male gametophyte genomes, as suggested by Rensing et al. (2007), where the divergencies existed before the WGD event. This then indicates that the dated nodal ages traced the divergency time of the separation of the two subgenomes and not the polyploidy event (Doyle & Egan, 2010). We favor this hypothesis, because the allopolyploidization of dioecious gametophytes could yield a monoecious plant that would enhance the dispersal of breeding populations (Rensing et al., 2007) and provide extra ecological advantages for polyploids.

Evolutionary rate variations among mosses
Variations in nucleotide substitutional rates in the evolution of angiosperm genomes is common and linked to distinctive life histories (Smith & Donoghue, 2008). The node of N4 corresponds to the MRCA of Funarioideae in the moss phylogeny. b The node of N20 corresponds to the MRCA of BDTF clade in the moss phylogeny. Fig. 3. Determining the phylogenetic position and age of the two polyploidy events in Physcomitrella patens by tracing the phylogenetic timing of genomic syntelogs. A, Exemplar maximum-likelihood gene tree from OrthoGroup_1244. Two duplication nodes (with BSV ≥ 80%) were identified in this gene family phylogeny and were highlighted with stars. The paralogous pairs Pp3c20_12330V3 and Pp3c23_1780V3 (connected by blue curve) were retained on syntenic regions in P. patens genome, supporting the Funarioideae (N4) duplication, whereas the paralogous pairs Pp3c20_12330V3 and Pp3c24_6950V3 (connected by red curve) were detected on P. patens genome syntenic regions, supporting the ψ (N20) duplication event. B, The spatial distribution for the N4 (connected by blue lines) and N20 (connected by red lines) duplicated syntelogs in the P. patens genome. The phylogenetic timing for the syntelog pairs in P. patens was inferred from gene family phylogenies (as summarized in Table 1). The Funarioideae (N4) and BDTF-wide (termed ψ event, N20) duplications were depicted as blue and red lines, respectively. C, K S distribution from 339 P. patens syntelog pairs supporting a Funarioideae duplication (BSV ≥ 50%) in the phylogenetic analysis. D, K S distribution of 114 P. patens syntelog pairs supporting a BDTF clade duplication (BSV ≥ 50%) in the phylogenetic analysis. E, F, K S distributions of P. patens syntelogs supporting Funarioideae (287 pairs) and BDTF clade (93 pairs) duplications in the phylogenetic analyses with BS ≥ 80%. G, A contrasting plot for the K S distributions of P. patens syntelogs supporting Funarioideae (blue) and BDTF clade (red) duplications (BSV ≥ 50%). H, K S distribution of all paralogous pairs identified from P. patens genome syntenic block analysis. BSV, bootstrap support value; BDTF, Bryidae, Dicranidae, Timmiidae, and Funariidae.
There were distinctive K S peaks for the shared ψ event, from 0.45 (T. austriaca) to 1.32 (Bryum argenteum; Fig. S7, A-R), suggesting that there are substantial evolutionary rate variations among the different lineages after the ψ event. Setting the ψ event as the ancestral calibration point, we compared the nucleotide substitutional rates among the different lineages using K S estimations of ψ paralogs (Fig. 4). There were no conspicuous lineage-wide rate shifts, but there were significant within-lineage differences in evolutionary rates (Fig. 4). The species within the Hypnales clade exhibited similar substitutional rates as the ψ event paralogs, but for B. argenteum, the rate was significantly elevated. Evolutionary rate elevations were also seen in the Dycranidae (Syntrichia caninervis) and the Funariideae (Funaria hygrometrica). Each of the three species, B. argenteum, S. caninervis, and F. hygrometrica, exhibited major K S peaks (Fig. S7) that exceeded 1.0 K S for the ψ event. In contrast, significantly slower nucleotide substitutional rates could be observed for T. austriaca, E. streptocarpa, and Scouleria aquatica, with major K S peaks (Fig. S7) smaller than 0.6. Additionally, duplication signals were also observed in all the K S frequency distributions for each of the whole paranome in BDTF moss species (Fig. S9).

Gene retention after ancient large-scale duplications
The four ancestral branches with large numbers of gene duplications documented here could have contributed to not only to gene family expansions but also functional innovations in the evolutionary past of mosses. The potential functional bias among the duplicated paralogs that "survived" the ancient duplications was evaluated using GO enrichment analyses for the homolog clusters that contained the ancient duplications. Recurrent gene duplications within a specific gene family were rare, as only 23 homolog clusters were detected to contain duplication signals for all four ancient duplication events (Fig. 5A), and in P. patens genome, we detected only seven such gene families (Fig. 5B). If paralogs generated from all four duplications were retained, they could have generated as many as 2 4 -fold gene member expansions; thus, gene losses would be common after the duplication events. The Venn diagram analyses confirmed that the majority of the homolog clusters (1516 out of 2180 homolog clusters, 69.5%) contained only one of the ancient duplication events. Homolog clusters that contained two (503 clusters, 23%) and three (138 clusters, 6.3%) duplication events were much reduced. Similar patterns were observed in the homolog clusters containing duplication events for P. patens paralogs; 1288 (81%), 248 (15.6%), and 45 (2.8%) homolog clusters that contained gene duplications were derived from single, two, and three duplication events, respectively. For gene duplications supported by both moss and P. patens paralogs, a conspicuously large proportion of the overlapping gene families survived both the ψ and the Funarioideae duplication events, indicating that duplication signals for these two younger WGD events in Funarioideae were persistent in gene family phylogenies, such as that depicted in Fig 3A. Duplicated genes originated from the ψ and Funarioideae events remain pervasive in the P. patens genome (Lang et al., 2018); however, the paralogous pairs that survived the older duplication events would be rarer. We observed paralogous pairs in several transcription factor families that survived the moss-wide and BPT duplications. The genes clustered in orthogroup_243 encode bZIP transcription factors and P. patens genes were found in both of the child clades that constituted a moss-wide duplication. The more recent duplications (the ψ and Funarioideae events) were observed in the gene family phylogeny (Fig. S10). Orthogroup_362 encodes the Nin-like transcription factors where moss-wide Fig. 4. Nucleotide synonymous substitutional rate variations among mosses. A violin plot contrasting K S distributions of the ψ and Funariideae duplicated paralogs in different mosses. All inferred paralogous pairs tracing the ancestral ψ and Funarioideae genome duplications (BSV ≥ 50%) were analyzed to calculate nucleotide substitutional rates among BDTF clade mosses. BSV, bootstrap support value; BDTF, Bryidae, Dicranidae, Timmiidae, and Funariidae and BPT duplication nodes were well-supported by the tree phylogeny (Fig. S11). Similarly, the WRKY (Fig. S12) and SBP (Fig.  S13) transcription factors in P. patens also survived the ancestral moss-wide duplication. The genes in orthogroup_462 encode homologs of vascular associated death 1, a regulator of cell death and defense vascular responses (Lorrain et al., 2004), and also retain duplicated genes derived from the moss-wide and BPT duplications (Fig. S14). In all cases (Figs. S10-S14), the more recent ψ and Funarioideae duplication events contributed extra family members for specific clades in the gene family phylogeny. The preservation of the paralogs derived from ancient duplication events and the recurrent acquisition of family members exert evolutionary and ecological significance for functional renovations and the species radiation process (Rensing, 2014).
To determine if genes of specific functional categories were preferentially retained after the ancestral large-scale duplication events, we performed a GO term enrichment analyses for the homolog clusters containing gene duplications (Figs. S15-S18). The retained gene duplications were identified by the duplicated paralog pairs that mapped to the four focal nodes (N26, N23, N20, and N4) in the ML gene trees. We observed both conserved and distinct GO slim plant functional categories associated with the four ancient duplication events (Figs. 5C and 6). Despite the conclusion that gene families containing all four recurrent duplications are rare, several GO categories such as "response to stress," "carbohydrate metabolic process," "anatomical structure morphogenesis," and "post-embryonic development" were significantly overrepresented for all four duplication events. The statistical FDR corrected P values were depicted using simplified color schemes: gray colors represented not significant GO terms, light blue for statistically overrepresented GO terms with FDR ≤ 0.05, and dark blue for terms with FDR ≤ 0.01. FDR, False Discovery Rate; GO, gene ontology Duplicated gene families involved in "cell communication" and "signal transduction" were only enriched after the latest three duplication events. Duplications in gene families participating in "cell differentiation" and "transporter activity" were especially enriched for the moss-wide duplication event, indicating that these fundamental activities might have been well established in the ancestor of all mosses. A number of GO terms such as "DNA-binding transcription factor activity" and "generation of precursor metabolites and energy" were enriched after the most recent duplication event shared by Physcomitrella and Funaria, which is consistent with the determination that the P. patens genomic duplication contributed a large number of metabolic genes to its genome (Rensing et al., 2007). The enrichment of genes related to transcriptional regulatory functions may further suggest a conserved gene retention pattern across land plants, as observed in peat mosses (Devos et al., 2016), ferns (Li et al., 2018), and angiosperms B, Overrepresented GO terms for the duplicated paralogs derived from the BPT-wide (Bryopsida, Polytrichopsida, and Tetraphidopsida, N23) duplications. C, Overrepresented GO terms for the ψ event (N20) in the ancestor of BDTF clade (Bryidae, Dicrsanidae, Timmiidae, and Funariidae, N20). D, Overrepresented GO terms for the Funarioideae duplications (N4) shared by Funaria and Physcomitrella. (Maere et al., 2005a). Gene families involved in "kinase activity," "cell communication," and "signal transduction" were statistically overrepresented, and as their products interact and function directly via interactions with other gene products (and/or form macromolecular complexes), they are also likely to be preferentially retained after gene duplications so as to sustain gene dosage balance.
The recurrent accumulation of stress-related genes after the large-scale duplication events may have contributed to the ability of mosses to adapt to different ecological environments (Fig. 6). Mosses are more prone to stressed conditions as mosses in micro-environments that vary greatly in water and nutrient availability. The accumulation of stressrelated genes is in parallel to the observation that desiccation tolerance or poikilohydry is common in mosses and very rare among angiosperms (Wood & Jenks, 2007). The duplication of genes involved in anatomical structure might relate to morphological features that occur in mosses that have adapted to distinct habitats, such as the awns of the desert moss S. caninervis, which is considered a water collection trait (Pan et al., 2016). The accumulation of stressrelated and transporter genes after duplication events was also observed in peat mosses, which flourish in hostile environments that are a challenge for both tracheophytes and mosses (Devos et al., 2016).
The functional analysis of the homolog clusters comes with a caveat. A large proportion of the data used in the analyses was obtained from moss transcriptomes and not all duplicated genes are necessarily sampled. This is especially true for the transcriptomes derived from the oneKP datasets where the sequencing depths are lower. The potentially missing data can limit the overall functional enrichment analyses, and dissimilar functional terms might be obtained when more comprehensive datasets are constructed in the future.
3.9 Implications of the ancestral large-scale duplications in mosses Paleopolyploidy events provide substantial raw genetic materials for evolution and drive species radiation (Schranz et al., 2012;Vanneste et al., 2014;Van de Peer et al., 2017); however, it was suggested that polyploidy could be an evolutionary "dead end" (Mayrose et al., 2011;Arrigo & Barker, 2012). The data we generated indicated that the ψ event may have provided an impetus for a major species radiation in the Bryopsida. This is strengthened by the observation that the early-diverging Bryopdisa lineages (Diphysciidae and Buxbaumiidae) that are outside of the ψ event are species-poor, and the BDTF clade containing more than 95% of current moss species (Newton et al., 2006) experienced the ψ event.
However, unlike the GAMMA event in the core eudicots where triplicated syntenic regions were found in the Vitis genome (Jaillon et al., 2007), the ψ event in the mosses only generated duplicated syntenic blocks (as seen in the P. patens genome). The ψ event in the P. patens lineage likely resulted in a duplication of the seven ancestral moss chromosomes into the current 13 (one chromosome loss) intermediate chromosomes (Lang et al., 2018).
Our phylogenomic analyses circumscribed the P. patens WGD2 as a Funarioideae duplication event that was shared by both P. patens and F. hygrometrica. However, some F. hygrometrica accessions have been reported to contain 14 or 56 chromosomes, and 51% of the accessions contained 28 chromosomes (Fritsch, 1991). The P. patens accession (Gransden 2004) has 27 chromosomes (Lang et al., 2018) and some isolates of P. patens have 14 or 28 chromosomes (Engel, 1968). This suggests that some species within the Funarioideae may have escaped the duplication event or that independent polyploidization events are common in the mosses. It is also possible that some BDTF species might have escaped the ψ event, which needs to be tested. This is especially true when the transcriptomic/genomic data become available for F. hygrometrica accessions with low (e.g., n = 7) chromosome numbers, even though low chromosome numbers alone could be unequivocal evidence for escaping the ancestral WGD; for example, the five Arabidopsis thaliana chromosomes contained three WGDs, whereas the ancestral eudicot genome was reconstructed as seven chromosomes (Salse et al., 2009).

Conclusions
A robust moss phylogeny was constructed using an extensive matrix of putative one-to-one orthogroups derived from transcriptomic data. Consistent evidence that supported the occurrence of a whole-genome duplication event (termed ψ) before the species radiation within the BDTF clade was generated. We circumscribed the WGD1 and WGD2 events in the ancestor of BDTF clade and Funarioideae, respectively. We identified another two branches with large numbers of gene duplications: one in the ancestor of all mosses and another before the separation of Polytrichopsida, Tetraphidopsida, and Bryopsida. Functional assessment of homolg clusters containing gene duplications offers a possible insight into the evolutionary forces that have impacted bryophyte diversification.

Supplementary Material
The following supplementary material is available online for this article at http://onlinelibrary.wiley.com/doi/10.1111/jse. 12683/suppinfo: Fig. S1. Phylogenomic workflow utilized in this study for species tree construction and identification of ancestral gene duplications in mosses. Fig. S2. Exemplary gene tree generated in FastTree before masking the single-taxa monophyletic and paraphyletic clades. Clades that were constituted with genes from a single species were highlighted in red and the two ancestral duplications in Funarioideae and BDTF clade were labeled with stars. Fig. S3. Exemplary gene tree generated in RAxML after masking the single-taxa monophyletic and paraphyletic clades. Clades constituted with genes from a single species in Fig. S2 were masked and one representative sequence retained using the Yang & Smith (2014) pipeline. The detection of the two ancestral duplications in Funarioideae and BDTF (labeled with stars) were not affected by this masking procedure. Fig. S4. Comparison of species trees constructed using supermatrix-based and coalescence-based approaches. Left panel depicts the maximum-likelihood nucleotide supermatrix tree topology from RAxML. The right panel depicts the coalescence-based species tree inferred from ASTRAL-III, generated from the 649 maximum-likelihood gene trees of each single-copy orthogroup. Nodal bootstrap support values indicated by asterisks are 100%. The supermatrixbased and coalescence-based approaches generated consistent species tree topology at the subclass level in mosses, the only variation was discovered in the clade of Hypnales. Fig. S5. The maximum-likelihood species tree generated using the 1st and 2nd codon positions of the concatenated alignment supermatrix. Branch lengths are scaled to substitutions per site. The MRCA node for BDTF clade was indicated, and the relatively short branch length might be congruent with the rapid speciation following the ψ WGD event. Fig. S6. The maximum-likelihood species tree generated from the concatenated protein alignment using RAxML under the PROTGAMMAAUTO model. Branch lengths are scaled to substitutions per site. The MRCA node for BDTF clade was indicated and highlighted. Fig. S7. K S distribution of all inferred duplicated paralogs in different species tracing the ψ and Funarioideae duplications. (A-S) Paralogous pairs with coalescence node at N20 (BSV ≥ 50) were identified in each BDTF clade species and corresponding K S distributions were analyzed. (S-T) K S distribution analyses of paralogous pairs with coalescence node at N4 (BSV ≥ 50) in Funaria hygrometrica and Physcomitrella patens. Fig. S8. Distribution of synonymous substitutions (K S ) in orthologous pairs between species of Bryidae, Dicranidae, Timmiidae and Physcomitrella patens. The modal K S values were calculated from Kernal Density Estimation function in R statistical environment and labeled in the orthologous K S distribution, providing supporting evidence for the relatively young age of the BDTF clade. Fig. S9. Paralogous K S frequency distribution of whole paranomes in the mosses. Nodal K S values were estimated and redundancies were eliminated according to Maere et al. (2005). The R package 'mclust' was employed to conduct the finite mixture model analyses and gaussian components with mean values in [0.3, 2] and proportions were depicted in the figures. Putative gaussian components corresponding to the ψ were highlighted in red fonts. Fig. S10. Exemplar maximum-likelihood gene family (OG-243) phylogeny contains Physcomitrella patens paralogs that survived ancestral duplication. Genes in orthogroup_243 encode the bZIP transcription factors. P. patens genes were highlighted with arrows and the moss-wide duplication node was labeled as brown star. More recent duplications (e.g. the ψ and Funarioideae duplications) supported by tree topology were labeled as green circles. Fig. S11. Exemplar maximum-likelihood gene family (OG-362) phylogeny contains Physcomitrella patens paralogs that survived ancestral duplications. Genes in orthogroup_362 encode the Nin-like transcription factors. P. patens genes were highlighted with arrows and the moss-wide duplication node was labeled as brown star and the BPT duplication was labeled as a yellow star. More recent duplications supported by tree topology were labeled as green circles. Fig. S12. Exemplar maximum-likelihood gene family (OG-504) phylogeny contains Physcomitrella patens paralogs that survived ancestral duplications. Genes in orthogroup_504 encode the WRKY transcription factors. P. patens genes were highlighted with arrows and the moss-wide duplication node was labeled as brown star. More recent duplications supported by tree topology were labeled as green circles. Fig. S13. Exemplar maximum-likelihood gene family (OG-1137) phylogeny contains Physcomitrella patens paralogs that survived ancestral duplications. Genes in orthogroup_1137 encode the SBP transcription factors. P. patens genes were highlighted with arrows and the moss-wide duplication node was labeled as brown star. More recent Funarioideae duplication supported by tree topology was highlighted with green circle. Fig. S14. An exemplar maximum likelihood gene tree from OrthoGroup_462 contains Physcomitrella patens paralogs that survived ancestral duplications. Genes in or-thogroup_462 encodes homologs of VAD1 (Vascular Associated Death1), a regulator of cell death and defense responses in vascular tissues. P. patens genes were highlighted with arrows and the moss-wide duplication node was labeled as brown star and the BPT duplication was labeled as a yellow star. More recent P. patens gene duplications supported by tree topology were labeled as green circles. Fig. S15. GO enrichment network depicting the overrepresented categories for the orthogroups containing the mosswide gene duplications. The node sizes were in proportion to the number of orthogroups in each of the GO categories and the adjusted p-values were depicted by the node color. Fig. S16. GO enrichment network depicting the overrepresented categories for the orthogroups containing gene duplications in the ancestor of BPT (Bryopsida, Polytrichopsida and Tetraphidopsida) clade. The node sizes werein proportion to the number of orthogroups in each of the GO categories and the adjusted p-values were depicted by the node color. Fig. S17. GO enrichment network depicting the overrepresented categories for the orthogroups containing gene duplications in the ancestor of BDTF (Bryidae, Dicranidae, Timmiidae and Funariidae) clade. The node sizes were in proportion to the number of orthogroups in each of the GO categories and the adjusted p-values were depicted by the node color. Fig. S18. GO enrichment network depicting the overrepresented categories for the orthogroups containing the Funarioideae gene duplications in the ancestor of Funaria and Physcomitrella. The node sizes were in proportion to the number of orthogroups in each of the GO categories and the adjusted p-values were depicted by the node color. Table S1. Summary of plant genomes and transcriptomes included in the phylogenomic analyses. Table S2. Species contributions to single-copy orthologous groups.