Evolutionary rate variation among genes involved in galactomannan biosynthesis in Coffea canephora

Abstract The endosperm cell walls of mature coffee seeds accumulate large amounts of mannan storage polysaccharides, which serve as nutrient reserve for embryo and contribute to beverage quality. Our study investigated the evolutionary patterns of key galactomannan (GM) biosynthesis genes using d N/d S ratio, synteny, and phylogenetic analysis and detected heterogeneity in rate of evolution among gene copies. Selection ratio index revealed evidence of positive selection in the branch editing gene Coffea canephora alpha (α) galactosidase (Cc‐alpha Gal) at Cc11_g15950 copy (ω = 1.12), whereas strong purifying selection on deleterious mutations was observed in the Coffea canephora uridine diphosphate (UDP)‐glucose 4′‐epimerase (Cc‐UG4E) and Coffea canephora mannose‐1P guanylytransferase (Cc‐MGT) genes controlling the crucial nucleotide carbon sugar building blocks flux in the pathway. Relatively low sequence diversity and strong syntenic linkages were detected in all GM pathway genes except in Cc‐alpha Gal, which suggests a correlation between selection pressure and nucleotide diversity or synteny analysis. In addition, phylogenetic analysis revealed independent evolution or expansion of GM pathway genes in different plant species, with no obvious inferable clustering patterns according to either gene family or congruent with evolutionary plants lineages tested due to high dynamic nature and specific biochemical cell wall modification requirements. Altogether, our study shows a significant high rate of evolutionary variation among GM pathway genes in the diploid C. canephora and demonstrates the inherent variation in evolution of gene copies and their potential role in understanding selection rates in a homogenously connected metabolic pathway.


| INTRODUC TI ON
Evolutionary studies have shown that plant genomes tend to have high rate of evolution as compared with animals due to polyploidization, local duplication events, and high retention rates of duplicate genes (Jeffrey, 2002). Gene duplication enables duplicated members to evolve under varied selective constraint (Panchy, Lehti-Shiu, & Shiu, 2016), leading to evolutionary rate variation among copies as copies tend to undergo different selection pressure due to their contribution to fitness (Kimura, 1977;Rausher, Lu, & Meyer, 2008). Homogenously connected metabolic pathways are increasingly being used to clarify selection and evolutionary rate dynamics because genes operate as connected components based on their position, flux control, and downstream effects within the pathway (Rausher, Miller, & Tiffin, 1999). Pathway architecture has been shown to affect protein evolutionary rates (Greenberg, Stockwell, & Clark, 2008), for example, increased d N for downstream enzymes have previously been reported in carotenoid, gibberellin, and anthocyanin pathway genes (Livingstone & Anderson, 2009;Rausher et al., 1999;Yang, Zhang, & Ge, 2009). Thus, metabolic networks create the potential for predicting differential evolutionary rates as a function of the network properties. For example, low evolutionary rates in upstream genes have been reported due to their flux control properties (Vitkup, Kharchenko, & Wagner, 2006). In addition to pathway properties, rapid evolutionary rate pattern has been reported in regulatory genes than in structural genes (Purugganan & Wessler, 1994;Tucker & Lundrigan 1993), and in some plant lineages as a result of environmental factors (Pierce & Crawford, 1997).
CWSPs of the mannan family are reserved in considerable amount as nutrient source for germinating embryo in endosperm of many albuminous monocot and dicot seed plants, such as date palm, guar, and coffee (Buckeridge, 2010). In addition, the mannan CWSPs enable seeds to retain living cellular endosperm at mature, confer seed firmness, and contribute to water buffering during drought (Buckeridge, 2010;Horbowicz & Obendorf, 1994). The galactomannans (GMs) structure consist of a linear mannose backbone chain with galactose side chains which vary in mannose: galactose ratio among species, for example, 1:1 and 20:1 in fenugreek and tobacco, respectively (Reid, Edwards, Dickson, Scott, & Gidley, 2003). The mannose/galactose ratio variation in legumes subfamilies has been shown to reflect systematics and evolutionary pattern in their adaptation to tropical and temperate environments. For example, tropical legumes exhibit high content of galactomannan as well as mannose/galactose ratio, which could be used as taxonomic F I G U R E 1 Simplified Galactomannan biosynthetic pathway in coffee seed markers (Buckeridge & Reid 1996;Polhill et al., ;Siol, Wright, & Barrett, 2010).
Coffee belongs to the Rubiaceae family, the genus Coffea with approximately 124 species. It is a tropical plant ( Figure 2) with Coffea arabica and Coffea canephora as the two widely cultivated species. The enormous coffee endosperm which constitutes approximately 99% of total seeds mass along with its ability to accumulate large amount of galactomannans makes it a suitable model plant for investigating metabolism as well as the evolution of galactomannan pathway genes. The structure of the diploid C. canephora genome suggests that it has not undergone whole-genome duplication (WGD) since triplication origin from core eudicots, and gene expansion is mainly attributed to local duplication (Denoeud et al., 2014). For example, tandem duplication in caffeine metabolic pathway genes has been implicated in convergent adaptive evolution in coffee. In this study, we investigated the evolutionary patterns of galactomannan biosynthesis genes in coffee using selection pressure, synteny, and phylogenetic analysis with an aim of demonstrating variation in evolution using multiple-copy analysis in C. canephora.

| Plant material, DNA extraction, amplification, and sequencing
A total of 23 C. canephora including 18 cultivated and five wild accessions used in this study are maintained at Spice and Beverage Research Institute of Chinese Academy of Tropical Agricultural Sciences (Hainan, China). Cultivated accessions were selected based on our previously published polymorphic data (Ogutu et al., 2016;Yan et al., 2019). Seeds were randomly collected at mature stage, immediately frozen in liquid nitrogen, and then stored at −80°C until use. Total DNA was isolated from twenty-three freshly frozen coffee seeds using our previously described modified protocol (Ogutu et al., 2016). PCR primers were designed for specific sequencing regions (Table S1)

| Sequence retrieval and alignment
Nonredundant NCBI blast search (Gish & States, 1993) was conducted to identify coding DNA sequences of C. canephora galactomannan gene copies using previously reported coding DNA sequences for Cc-alpha, Cc-Man-S, and Cc-alpha Gal as query sequences (Joët et al., 2014;Pré et al., 2008), except for Cc-MGT and Cc-UG4E whose query sequences were retrieved from Arabidopsis and rice, respectively (Conklin et al., 1999;Guevara, El-Kereamy, Yaish, Mei-Bi, & Rothstein, 2014). The partial Cc-UG4E and Cc-MGT sequences (Joët et al., 2014) were used to blast Arabidopsis and rice databases and the best hits results were selected and subsequently used to retrieve complete coding sequences from C. canephora genome after confirming their annotation.

| Sequence diversity and codon usage bias test
The average pairwise nucleotide sequence diversity index, π, was calculated using Jukes and Cantor correction method in DnaSP v5.0 (Librado & Rozas, 2009). Effective number of codon (ENC) was tested in DnaSP and confirmed in CodonW (Peden, 1999). ENC is a F I G U R E 2 Coffee plant with mature cherries codon usage bias test that can reflect the level of selective constraint on a particular gene and variation in synonymous substitution rate is related to codon usage (Sharp, 1991;Sharp, Tuohy, & Mosurski, 1986). Low ENC values represent greater codon bias, while high ENC represents low codon usage bias (Wright, 1990).

| Variation in d N , d S, and ω among genes
The synonymous (d S ) and nonsynonymous (d N ) nucleotide substitution rates were calculated using CodeML feature in the phylogenetic analysis by maximum likelihood, PAML v4.3 (Yang, 1997) as described by (Jeffares, Tomiczek, Sojo, & Reis, ). Briefly, coding DNA sequences were translated to protein sequences, and their orthology determined for tree construction and CodeML analysis in PALM. Overall, cumulative mutations across the whole gene region can effectively be used to estimate positive selection or selective pressure by calculating substitution ratio of nonsynonymous and synonymous mutations (d N /d S = ω). Under neutral selection, both nonsynonymous and synonymous are fixed at the same rate thus ω = 1. Purifying selection prevents fixation of deleterious mutation, thus ω < 1, due to excess synonymous over nonsynonymous substitutions. When nonsynonymous mutation offers selective advantage, is it fixed at a higher rate than synonymous mutation?
Thus, ω > 1 (Yang & Nielsen, 2000). Due to complex demographic history of domesticated plants, parameters allowing for deviations from the standard neutral model were used to best reflect specific demographic events (Wright & Gaut, 2005). Evidence of positive selection was determined by calculating selection at each codon for all genes using likelihood-ratio test (LRT) to compare M8 and M8a models (Swanson, Nielsen, & Yang, 2003), which allow for positive selection (beta, and ω > 1) and near neutral model M8a (ω = 1), respectively, for gene copies in PAML. Branch-site analysis with model A was used to detect independent ω values of positive selection, and Bayes Empirical Bayes analysis (BEB) was performed to identify sequences under selection. For statistical significance, bivariate correlation between sequence length and mutation rates d S , d N, and ω was tested to confirm reliability of variation in the data.

| Phylogenetic and synteny analysis of galactomannan pathway genes
All-against-all BlastP in different plant genome databases was conducted during the period of March 2018 to retrieve putative galactomannan pathway gene copies and their corresponding annotations using default parameters (Altschul, Madden, Schaffer, Zhang, & Zhang, 1997;Gu, Cavalcanti, Chen, Bouman, & Li, 2002). Homology was inferred for sequences with at least 40% identity and 70% alignable region lengths and subsequently used for phylogenetic analysis.
Conserved synteny analysis of short chromosomal segments that included regions of GM pathway genes was conducted to visualize syntenic linkages in the C. canephora genome using genome comparison visualizer in Easyfig software (Sullivan, Petty, & Beatson, 2011).
Alignment of chromosome segments that contained putative GM pathway gene copies and their flanking regions extending 100 kb in both forward and reverse was conducted in MAFFTv6.7 (Kazutaka & Daron, 2013), and two chromosome regions with best alignment hits were selected for synteny analysis.

| Analysis of sequence polymorphism and codon usage bias
A total of 18 GM pathway gene copies were identified in C. canephora (Cc) genome and used for sequence variation test and evolutionary rates analysis. Evolutionary rate at silent (d S ) and replacement (d N ) mutation, nucleotide sequence polymorphism, and ω was compared among multiple gene copies of the GM pathway (Table 1). Overall, the average nucleotide sequence diversity (π) ranged from 0.03 in Cc-MGT gene copies to 1.13 in Cc-alpha Gal genes. Polymorphic variation within each gene copies varied from 0.02 to 0.24 for Cc08_g06540 and Cc10_g07720 in Cc-ManS; from 0 to 0.18 for Cc07_g07210 and Cc07_g07220 in Cc-GMGT and from 0.08 to 0.74 for Cc11_g00330 and Cc11_g15950 in Cc-alpha Gal genes copies (Table 1). In contrast, a relatively low sequence polymorphism was observed in Cc-UG4E and Cc-MGT gene copies. Most codon usage bias was detected among Cc-UG4E and Cc-MGT gene copies based on lowest average ENC of 38.9 and 40.9, respectively, while Cc-alpha Gal gene with the highest average d N /d S showed least codon usage bias with a mean average ENC of 57.8 (Table 2).

| Comparing d N , d S, and ω among gene copies
The average d S and d N values are shown in Figure 3 (Maria, Joseph, & Ziheng, 2001).
In addition, increased nonsynonymous substitution was detected in Cc-Man-S gene at Cc10_g07720 (ω = 0.73) and in Cc-alpha Gal gene at Cc02_g05490 (ω = 0.71), Cc04_g14280 (ω = 0.54) and Cc11_g00330 (ω = 0.73). However, the LRT test was not significant for Cc02_g05490. Overall, high d N /d S ratio variation among and within gene copies was observed, which demonstrates a significantly diverse evolutionary rate in GM pathway genes. These results suggested different rates of selection events on GM pathway gene copies in C. canephora. More selection events appeared to have occurred in Cc-alpha Gal gene, while Cc-UG4E and Cc-MGT copies showed evidence of strong selective constraint. However, interpreting whether the observed difference in selection rates across the pathway is actually more than expected at random requires conjunction of this data with results from population genetics models that include recombination between selected sites and nearby neutral marker (Charlesworth, 2006;Schmid, Ramos-Onsins, Ringys-Beckstein, Weisshaar, & Mitchell-Olds, 2005).

| Synteny and phylogenetic analysis
Phylogenetic and synteny analysis was conducted to get insight into evolutionary history of GM pathway gene copies. Extensive synteny was observed for most sequence segments with strong conservation F I G U R E 3 Evolutionary rates at silent and replacement sites of the GM pathway genes. Average values of d S , d N, and d N /d S are represented by horizontal bars within the boxes at 95% confidence intervals of tandemly linked regions spanning ~60% in length for duplicate copies in Cc-Man-S, Cc-GMGT, Cc-UG4E, and Cc-MGT (Figure 4; Figure   S1). In contrast, a relatively weak synteny was detected at Cc-alpha Gal, which might be due to more selection events accompanying its duplication in C. canephora.
The phylogenetic analysis, however, failed to support nucleotide diversity in the dataset with paralogs in most species tested forming distinct subclusters. In addition, no obvious orthologous clustering congruent with species evolution could be inferred. Each gene family appeared to have unique expansion patterns or have undergone independent evolutionary events in different plant lineages analyzed in this study. In addition, analysis of GMGT and UG4E phylogenies indicated independent subclades that corresponded with monocots and dicots groups ( Figure 5; Figures S2 and S3).

| D ISCUSS I ON
Our study examined the evolutionary rate analysis of the galactomannan pathway genes based on multiple copies analysis in the diploid C. canephora this is because multiple-copy encoding enzymes can provide direct evidence of gene evolution patterns in metabolic pathways (Chu, Wang, Cheng, Yang, & Yu, 2014). Five wild and eighteen accessions representing independent, divergent species (Kryazhimskiy & Plotkin, 2008; were sequenced to analyze evolutionary patterns in 18 GM pathway gene copies. Differential selection rates on genes cause effect on patterns of nucleotide sequence diversity (Ma et al.,2015, Ramsay, Rieseberg, & Ritland, 2009. In this study, we detected a correlation between different selection pressure and patterns of sequence diversity, sugar building blocks, respectively. Thus, a strong negative selection is needed to optimize and remove deleterious mutations (Lu & Rausher, 2003) to ensure their novel evolutionary function remains fixed in the population. In addition, previous studies reveal that high gene expression is correlated to their fitness and significant role in survival (Blanc & Wolfe, 2004;Popescu, Borza, Bielawski, & Lee, 2006). Fitness genes exhibit strong codon bias due to intense purifying selection for translation efficiency and their expression levels are usually higher (Blanc & Wolfe, 2004;Brown & Kelly, 2018;Lavin, Herendeen, & Wojciechowski, 2005). More recently, expression of core GM pathway genes, including Cc-UG4E and Cc-MGT, was reported to be positively correlated with the amount of stored cell wall storage polysaccharides (Joët et al., 2014). Altogether, these reports are in agreement with the strong negative selection and greater codon bias observed for Cc-UG4E and Cc-MGT in our study.
Frequency of positive selection and selective constraints are most likely factors to account for evolutionary variation among gene copies in a metabolic pathway (Braverman, Hamilton, & Johnson, 2016;Gaut, Yang, Takuno, & Eguiarte, 2011;Yang et al., 2009). High ratio of synonymous (silent) to nonsynonymous (replacement) variation among and within gene copies observed in this study demonstrates a significant rate of evolutionary heterogeneity in GM pathway genes, which is consistent with high degree of endogenous heterogeneity reported in several cell wall storage polysaccharides (Gruppen, Kormelink, & Voragen, 1993;Toole et al., 2012). The d N /d S ratio and codon-based analysis detected signature of positive selection in Cc-alpha Gal gene copy.
Interestingly, different isoforms of α-Gal gene have been reported to exist in plant tissues (Chrost & Schmitz, 2000;Dey & Del Campillo 1984) which could contribute to high rate gene mutation (Lynch & Conery, 2000). In addition, enzyme position in a metabolic pathway has been linked to evolutionary rate variation with upstream genes evolving more slowly than the downstream genes (Clotault, Peltier, Soufflet-Freslon, Briard, & Geoffriau, 2012). This is because upstream genes have greater pleiotropic effects on a number of pathway end products (Cork & Purugganan, 2004;Pal, Papp, & Lercher, 2006;Rausher et al., 1999). The Cc-alpha Gal is a debranching enzyme catalyzing the postdepositional degree of galactose side branch to determine the mannose to galactose ratio, and thus, it is possible that its position in the GM biosynthesis pathway and occurrence in different isoforms could have played a role in its accelerated rate of accumulating adaptive mutations in C. canephora. No positive selection was observed in Cc-GMGT or Cc-Man-S, which together catalyze the polymerization of core galactomannan structure, despite the latter showing relatively high average d N /d S ratio in some copies, suggesting that it might be under relaxed selective constraint.
In contrast, phylogenetic analysis failed to support nucleotide diversity with most paralogs in species tested forming distinct subclusters. In addition, no obvious orthologous clustering congruent with species phylogeny could be inferred. Similar results have also been observed in other plant cell wall genes (Ahn et al., 2007;Figueiredo, Lashermes, & Aragão, 2011). This may be due to the highly variable and dynamic nature of plant cell walls throughout growth and development process (Campbell & Braam, 1999 (Yin, Huang, Gu, Bar-Peled, & Xu, 2011).
In summary, this study investigated the evolution patterns of five core structural genes of the galactomannan pathway using multi-copy based analysis and identified significant variation in evolutionary rate among genes. Positive selection detected in the GM branch editing gene Cc-alpha Gal is likely to explain the reason for varying mannose to galactose ratio in coffee seeds. Highly conserved Cc-UG4E and Cc-MGT suggest a strong negative selection, which purge changes that cause deleterious effects on the fitness due to their crucial role in sucrose carbon flux for galactomannan accumulation in coffee seeds. We also observed lineage-specific gene family expansion for MGT, alpha Gal, and Man-S, while independent expansion of GMGT and UG4E gene families in monocots and dicots. Our results provide insights into the evolutionary rate variations in endosperm cell wall storage polysaccharides genes in the diploid C. canephora.

ACK N OWLED G M ENTS
This project was supported by funds received from the Overseas Construction Plan for Science and Education Base, China-Africa Center for Research and Education, Chinese Academy of Sciences (grant no. SAJC201327).

CO N FLI C T O F I NTE R E S T
The authors declare that they have no competing interests. Lei Zhao, and Mr. Mohammad A. Belal helped with data analysis.

CO N S E NT FO R PU B LI C ATI O N
Not applicable.

E TH I C S A PPROVA L A N D CO N S E NT TO PA RTI CI PATE
All methods used to collect observational data were noninvasive.

DATA AVA I L A B I L I T Y S TAT E M E N T
All DNA sequences have been submitted to the GenBank databases, and accession numbers are provided in the methods section.