A genome-wide phylogenetic reconstruction of family 1 UDP-glycosyltransferases revealed the expansion of the family during the adaptation of plants to life on land


  • Lorenzo Caputi,

    1. Fondazione Edmund Mach, Centro Ricerca e Innovazione, Department of Food Quality and Nutrition, Istituto Agrario San Michele all’Adige–IASMA, Via E. Mach 1, 38010 San Michele all’Adige (TN), Italy
    Search for more papers by this author
    • These authors equally contributed to this work.

  • Mickael Malnoy,

    1. Fondazione Edmund Mach, Centro Ricerca e Innovazione, Department of Genomics and Biology of Fruit Crop, Istituto Agrario San Michele all’Adige–IASMA, Via E. Mach 1, 38010 San Michele all’Adige (TN), Italy
    Search for more papers by this author
    • These authors equally contributed to this work.

  • Vadim Goremykin,

    1. Fondazione Edmund Mach, Centro Ricerca e Innovazione, Department of Genomics and Biology of Fruit Crop, Istituto Agrario San Michele all’Adige–IASMA, Via E. Mach 1, 38010 San Michele all’Adige (TN), Italy
    Search for more papers by this author
  • Svetlana Nikiforova,

    1. Fondazione Edmund Mach, Centro Ricerca e Innovazione, Department of Genomics and Biology of Fruit Crop, Istituto Agrario San Michele all’Adige–IASMA, Via E. Mach 1, 38010 San Michele all’Adige (TN), Italy
    Search for more papers by this author
  • Stefan Martens

    Corresponding author
    1. Fondazione Edmund Mach, Centro Ricerca e Innovazione, Department of Food Quality and Nutrition, Istituto Agrario San Michele all’Adige–IASMA, Via E. Mach 1, 38010 San Michele all’Adige (TN), Italy
    Search for more papers by this author

(fax +39 461 615200; e-mail stefan.martens@iasma.it).


For almost a decade, our knowledge on the organisation of the family 1 UDP-glycosyltransferases (UGTs) has been limited to the model plant A. thaliana. The availability of other plant genomes represents an opportunity to obtain a broader view of the family in terms of evolution and organisation. Family 1 UGTs are known to glycosylate several classes of plant secondary metabolites. A phylogeny reconstruction study was performed to get an insight into the evolution of this multigene family during the adaptation of plants to life on land. The organisation of the UGTs in the different organisms was also investigated. More than 1500 putative UGTs were identified in 12 fully sequenced and assembled plant genomes based on the highly conserved PSPG motif. Analyses by maximum likelihood (ML) method were performed to reconstruct the phylogenetic relationships existing between the sequences. The results of this study clearly show that the UGT family expanded during the transition from algae to vascular plants and that in higher plants the clustering of UGTs into phylogenetic groups appears to be conserved, although gene loss and gene gain events seem to have occurred in certain lineages. Interestingly, two new phylogenetic groups, named O and P, that are not present in A. thaliana were discovered.


Plants have developed the ability to produce an enormous number of secondary metabolites that are not required in the primary processes of growth and development, but are of vital importance for their interaction with the environment, for their reproductive strategy and for their defence mechanisms. This ‘chemodiversity’ is particularly developed in land plants, which have to face numerous challenges such as desiccation, temperature stress and ultraviolet (UV) radiation, and, in particular, in vascular plants that also need to maintain efficient transport of the metabolites, structural rigidity and a fine regulation of homeostasis. Glycosyltransferases (GTs) play pivotal roles in many of these important processes.

GTs are a ubiquitous group of enzymes that catalyse the transfer of a sugar moiety from an activated donor molecule onto saccharide or non-saccharide acceptors resulting in the formation of poly-glycosides, di-saccharides, and various glycosides of non-carbohydrate moieties, including small molecules of secondary metabolism. The biosynthesis of such molecules is of great biological importance due to the diverse functions that they carry out within the organism and it involves the action of hundreds of different, and more or less selective, GTs (Coutinho et al., 2003). GTs that typically transfer sugars onto small molecules are grouped into family 1 of a classification scheme that currently includes 94 GT families (CAZy database; http://www.cazy.org; November 2011). This family comprises highly divergent enzymes from plants, animals, fungi, bacteria, and also viruses. In plants, GTs utilise UDP-activated sugars as the major donor molecule and they contain the well conserved UDP-glycosyltransferase (UGT)-defining motif close to the C-terminus as a unifying feature, which is one of the few regions of significant sequence similarity. The so-called PSPG motif (Plant Secondary Product Glycosyltransferase) comprises 44-amino acids residues and represents the nucleotide-diphosphate-sugar binding site of the enzymes (Gachon et al., 2005).

UGTs involved in plant secondary metabolism often display broad substrate specificity, at least in in vitro experiments, with recombinant proteins recognising a wide range of natural products as acceptor molecules. This promiscuity could also contribute to the immense skeletal variations of small molecules regarding their glycosylation pattern. However, UGTs can also be selective and the substrate specificities of some enzymes are defined by regiospecific or regioselective features of the aglycones (Lim et al., 2003a). In general, glycosylation can enhance the molecule stability and water solubility, therefore altering its movement within the cell, and it can change its biological activity (Bowles et al., 2006). UGTs have been discovered that glycosylate terpenoids (Caputi et al., 2008), benzoates (Lim et al., 2002), flavonoids (Lim et al., 2004), phenylpropanoids (Lanot et al., 2006), saponins (Shibuya et al., 2010), and plant hormones (Jackson et al., 2002; Poppenberger et al., 2005), just to mention a few examples. UGTs play also a pivotal role in the detoxification and deactivation of xenobiotics, such as 3,4-dichloroaniline (Brazier-Hicks et al., 2007a) and explosives (Gandia-Herrero et al., 2008), and in plant–pathogen interactions (Sullivan et al., 2001; Lorenc-Kukula et al., 2009).

To date, hundreds of UGT genes from model and non-model plants including crops, ornamentals and medicinal plants have been cloned and functionally characterized by in vitro and/or in vivo studies. However, considering the number of UGTs present in a plant genome, the number of functionally characterized proteins is still relatively low. Furthermore, a number of studies indicated that drawing conclusions on the in vivo activity of the UGTs can be difficult and in vitro activity studies can sometimes be misleading. For instance, the A. thaliana enzyme UGT73C6 was characterized as a UDP-glucose:flavonol-3-O-glycoside-7-O-glucosyltransferase by using a T-DNA knock-out line that lacked quercetin-3-O-rhamnoside-7-O-glucoside and in in vitro assays using the recombinant protein (Jones et al., 2003). The authors found that the enzyme was also able to convert several flavonoid aglycones, such as kaempferol, quercetin, apigenin and genistein to glycosylated products. However, flavones (apigenin) and isoflavones (genistein) are not naturally occurring metabolites in A. thaliana. Recently, Husar et al. (2011) reported UGT73C6 as an enzyme that glucosylates brassinosteroids in A. thaliana. This example highlights the basic problem that scientists face when approaching the study of the functions of these enzymes, which are selective in some cases and can display regio- and stereo-specificity, but in other cases can be very promiscuous such that they recognise a range of substrates and produce multiple products. The release of more and more plant genomes nowadays provides a massive amount of information on genes and the multigene families involved in secondary metabolites biosynthesis. This knowledge can be used for different purposes, such as the understanding of the molecular adaptations that occured during plant evolution, the elucidation of new or not well known metabolic pathways and the identification of genes that could be used for biotechnological applications.

In this study, we performed a genome-wide comparative analysis of sequences that encode putative UGTs in the Plant Kingdom. These sequences were identified from the available fully sequenced and assembled genomes of the green alga Chlamydomonas reinhardtii, of relatively basal land plants such as Selaginella moellendorffii and Physcomitrella patens, and of various mono- and dicotyledonous plants including herbs, ornamentals, shrubs and trees (Fig. 1), by using the PSPG feature. This strategy turned out to be highly efficient, as candidate sequences could be identified with a high degree of confidence, despite limited sequence homology overall. We reconstructed the phylogeny of the large multigene family that included all the retrieved putative UGTs by employing a fast protein ML search. The topology obtained allowed us to gain new insights into evolutionary relationships among UGTs, to trace gene loss and gene gain events, to identify sequence fingerprints that defined the different phylogenetic groups and to conduct prediction of gene function.

Figure 1.

 Simplified tree of the Plant Kingdom.
Modified and condensed from: Pennisi (2011) and Clarke et al. (2011). The simplified tree shows the separation of Lycophytes from Euphyllophytes soon after plants colonized the terrestrial environment. Only the species used in this study are included to illustrate major developments in plant evolution. Numbers indicate the genome size.


UGTs expanded in land plant genomes

In the genomes of the single-celled green alga C. reinhardtii, the moss P. patens and the spikemoss S. moellendorffii, the three more ancient organisms included in our study (Fig. 1), we identified one, 12 and 74 putative UGTs, respectively (Table 1). When the number of genes found in P. patens, belonging to the division of Bryophyta (non-vascular plants), was compared with those identified in S. moellendorffii, a lycophyte (possibly, the oldest living vascular plants), it could be noticed that a further sixfold increase in the number of putative UGTs occurred. The percentage of putative UGTs in the total number of genes as well as the number of UGTs per megabase of genome in S. moellendorffii were found higher than in P. patens, indicating that the expansion was not related necessarily to an increase in the genome size but could be determined by the development of new functions.

Table 1.   UGTs genes in the different genomes sequenced
Plant speciesPredicted number of genesGenome size (Mb)Putative UGTs retrievedPutative UGTs% UGTs in total number of genesUGTs per MbReferences
Arabidopsis thaliana274161351201070.460.79TAIR10 (11/10)
Malus domestica575247503232410.420.32Velasco et al. (2010)
Vitis vinifera335144872101810.540.37Jaillon et al. (2007)
Populus trichocarpa456544852231780.390.37Tuskan et al. (2006)
Glycine max4643011152311820.390.16Schmutz et al. (2010)
Cucumis sativus26682243101850.320.35Huang et al. (2009)
Mimulus guttatus28282321.71301000.350.31Phytozome web site
Oryza sativa250123892001800.720.46Tanaka et al. (2008)
Sorghum bicolor276407301911800.650.25Paterson et al. (2009)
Selaginella moellendorffii22285212.5139740.330.35Banks et al. (2011)
Physcomitrella patens3835448015120.030.025Phytozome web site
Chlamydomonas reinhardtii17114112110.0060.009Phytozome web site

In the two monocots considered in this study, S. bicolor and O. sativa, the number of putative UGTs identified was 180 in both plants, representing a 2.4-fold increase in the number of sequences when compared with S. moellendorffii (Table 1). In both species, a two-fold increase of the percentage of putative UGTs in the total number of genes was also observed. Nevertheless, the number of UGT sequences per megabase did not increase proportionally.

The number of putative UGTs found in the genomes of the selected dicots ranged from 85 in C. sativus to 241 in M. domestica. In V. vinifera, G. max and P. trichocarpa, we found about the same number of sequences, 181, 178 and 181, respectively (Table 1). The percentage of UGTs in the total number of genes in the considered dicots was very similar within the selected plant species, ranging from 0.32 in C. sativus to 0.54 in V. vinifera. These values were slightly lower than those observed in monocots (0.65 in S. bicolor and 0.72 O. sativa). The number of genes per megabase ranged from 0.16 in G. max to 0.79 in A. thaliana and the results were comparable with those obtained for the monocots. Within the analysed genomes, the highest number of putative UGTs was found in M. domestica. However, this expansion was associated with an increase in the genome size and in the total number of predicted genes.

Whilst mammalian UGTs are membrane-bound enzymes and can also contain an N-terminal signal sequence for segregation into the endoplasmic reticulum, plant enzyme sequences do not possess obvious targeting information (Li et al., 2001). However, three UGTs have been reported to date containing both the PSPG motif and a hydrophobic sequence (Miller et al., 1999; Woo et al., 1999; Moraga et al., 2004). For this reason, we decided to analyse the UGT sequences using TargetP program, which has high sensitivity and specificity for plant sequences, to identify proteins carrying a signal peptide for subcellular localization. The results indicated that none of the sequences contained such motifs, which supported the assumption that plant UGTs are cytosolic enzymes.

Phylogeny reconstruction of the plant UGTs multigene family

The putative UGTs identified in the plant genomes were used to build a phylogenetic tree that allowed the study of the molecular evolution and organisation of the multigene family. As the multigene family has been well characterized in A. thaliana (Li et al., 2001; Ross et al., 2001; Paquette et al., 2003), we included the A. thaliana 107 UGT sequences in our analyses to highlight the differences existing between the gene families in other plants and A. thaliana in terms of gene loss and gene gain events. The alignment of 1520 sequences showed that UGTs in the different plant species are dissimilar for most of the sequence to the degree that homology of the variable amino acid regions in these genes is not obvious. However, plant UGTs contain several conserved domains in their amino acid sequences that are apparently homologous (Li et al., 2001). We used these domains for phylogeny reconstruction. The putative UGT identified in the C. reinhardtii genome, although containing an unambiguous PSPG motif (Figure S1), displayed very low sequence similarity in the considered regions and it was therefore excluded from the phylogeny reconstruction.

The maximum likelihood (ML) tree based on these data is shown in Figure 2(a). The gene identifications of the sequences used in the phylogenetic reconstruction are provided in Table S1.

Figure 2.

 Phylogenetic analysis of plant Family 1 UGTs.
(a) Reconstructed phylogenetic tree obtained from 1520 putative UGT sequences aligned using Muscle v3.8.31 software. Phylogenetic analyses by ML method were performed employing RAxML v. 7.0.4 as described in the Experimental Procedures section. The phylogenetic tree was visualized using the program dendroscope v.2.7.4.
(b) Distribution of the putative UGTs from different plants within groups E and L, showing the separation between UGTs from monocots and dicots. The groups were extracted from the reconstructed tree and visualized using dendroscope v.2.7.4

In an earlier study, the analysis of the conserved amino acids encoded by the A. thaliana UGTs supported the presence of 14 distinct groups designated A to N (Ross et al., 2001). These groups were all recovered in our phylogenetic analysis with high bootstrap values (BP) for most respective clades. In addition, our analysis revealed two new groups, named O (supported by 92% BP) and P (supported by 96% BP), which did not contain any A. thaliana sequences. Surprisingly, the new groups contain sequences from all the other higher plants investigated (V. vinifera, G. max, S. bicolor, O. sativa, P. trichocarpa, C. sativus, M. guttatus and M. domestica), revealing that two massive gene loss events occurred in the A. thaliana lineage.

As expected on the basis of the evolutionary separation existing between mosses and vascular plants, the majority of the putative enzymes from P. patens formed an independent cluster separated not only from the sequences from the higher plants but also from those identified in the spikemoss S. moellendorffii. On the other hand, with the exception of two putative UGTs clustering in group A and two in group E, all the sequences retrieved from the genome of S. moellendorffii formed a large cluster next to the phylogenetic group F, which indicated a closer relationship with those from the vascular plants compared with the moss (Figure 2a).

The phylogeny reconstruction also showed that the sequences from the different plant species within each phylogenetic group generally have a tendency to cluster together, albeit in some cases they appear to be scattered. This phenomenon, which implies a higher sequence homology between closely related species, is particularly evident in the larger groups, such as E and L (shown as examples in Figure 2b), in which the occurrence of deep branches that are composed exclusively of putative enzymes from the monocots rice and sorghum makes separation of the enzymes from these plants from those identified in the dicots obvious.

Distribution and functions of the putative UGTs in the different phylogenetic groups

From a detailed analysis of the reconstructed phylogenetic tree it was possible to get an insight into the distribution of the UGTs across the phylogenetic range covered (Table 2, Figure 3a). UGT sequences from P. patens formed an isolated cluster separated both from the putative enzymes of S. moellendorfii and from those of the higher plants, whilst the enzymes from S. moellendorfii clustered in three of the phylogenetic groups present in higher plants (A, E and F).

Table 2.   Distribution of the UGTs in the different phylogenetic groups
 ABCDEFGHIJKLMNOPIsolated clusters
Arabidopsis thaliana1433132236191221711
Malus domestica3347135564014111261613155
Vitis vinifera2334846515714423151211
Populus trichocarpa12261449425562236132
Glycine max2531433611531832194153
Cucumis sativus1012121311521172135
Mimulus guttatus10231114121429172193
Oryza sativa14982638207931235269
Sorghum bicolor104624501712831266382
Selaginella moellendorffii22691
Physcomitrella patens12
Figure 3.

 Expansion of the plant Family 1 UGTs in the selected plant species.
(a) Percentage of putative UGTs in the different phylogenetic groups.
(b) Comparison of the fold increase of putative UGTs in each group with Arabidopsis thaliana as reference.

Five phylogenetic groups, A, D, E, G and L, seem to have expanded more than the others during the evolution of the higher plants, even though the number of genes found in each of these groups varied considerably between the different species. With some exceptions, the phylogenetic group E has expanded more than any other group. In fact, the number of putative genes in this group accounts for 20–25% of the putative UGTs genes present in the multigene family. These results differ slightly from those reported previously in A. thaliana (Li et al., 2001; Ross et al., 2001). In fact, albeit A, D, E and L are major groups, A. thaliana group G includes only six members (about 6% of the total UGTs), whilst group H has expanded to become the second most abundant group in this species (18%). In addition to the gene loss events that seem to have occurred in A. thaliana, which lacks groups O and P, two other gene loss and gene gain events could be suggested based on our analysis. Group F was found to be present in only four of the higher plants that we studied, V. vinifera, M. domestica, A. thaliana and G. max and group I members could not be identified in C. sativus.

When the number of putative UGTs in each phylogenetic group was compared with that reported in A. thaliana, it was possible to notice that groups I, G and M were over-represented in all the other plants (Figure 3b). Group I has only one member in A. thaliana, whilst it contains 11, 14 and 18 putative UGTs in M. domestica, V. vinifera and G. max, respectively. The number of genes clustering in group G varied from six in A. thaliana to 40 in M. domestica and to 42 in P. trichocarpa, which constitutes a six- to seven-fold difference, and the number of genes in group M varied from one in A. thaliana to 13 in M. domestica. In M. domestica, the phylogenetic group J has expanded more than in other plants whilst in M. guttatus group K was found to contain five times more putative UGT genes when compared to A. thaliana and to other plants.

In spite of the high number of putative UGTs that could be identified in the considered genomes, the number of UGTs characterized functionally is rather small and the majority of the information available comes from A. thaliana. With the exception of group A, several enzymes from the largest phylogenetic groups have been functionally characterized in A. thaliana. Two enzymes from the phylogenetic group A in G. max, GmSGT2 and GmSGT3, were shown to be involved in the biosynthesis of soyasaponin I (Shibuya et al., 2010). Several enzymes from A. thaliana groups D have been shown to recognise a range of substrates, including terpenoids (Caputi et al., 2008), flavonoids (Lim et al., 2004), benzoates (Lim et al., 2002) and brassinosteroids (Poppenberger et al., 2005) but no enzyme from the group D of the other plant species included in this study has been functionally characterized so far. The large group E in A. thaliana contains enzymes that possess similar activities to those from group D (using terpenoids, benzoates and flavonoids as substrates). It also contains UGT78B1, a bifunctional N- and O-glucosyltransferase involved in xenobiotics metabolism (Brazier-Hicks et al., 2007b). Malus UGT71A15, UGT71K1 and MdPGT1 belong to this group that convert the dihydrochalcone phloretin to phloridzin (Jugdéet al., 2008; Gosch et al., 2010) and also the G. max Gma88E3, a UDP-glucose:isoflavone-7-O-glucosyltransferase that is expressed in roots (Noguchi et al., 2007). The phylogenetic group L is unique in terms of enzyme activities. In fact, members of this group in A. thaliana are able to catalyse the formation of glucose ester bonds that recognise carboxylic groups on a variety of different metabolites such as phenylpropanoids (Lim et al., 2003b), benzoates (Lim et al., 2002) and the auxin indole-3-acetic acid (Jackson et al., 2001). The recognition of carboxyl groups, however, is not specific as the same enzymes can also attach sugar moieties to hydroxyl groups. In our study, we identified two putative UGTs from V. vinifera that share a >94% identity with a bifunctional resveratrol/hydroxycinnamic acid glucosyltransferase isolated from V. labrusca (Hall and De Luca, 2007). For the activity of group G enzymes, in A. thaliana they were shown to glycosylate the primary hydroxyl groups of terpenoids (Caputi et al., 2008) and the N6 side chain of trans-zeatin and dihydrozeatin (Hou et al., 2004). The only enzyme functionally characterized from the other plants under analysis is UGT85B1 from S. bicolor, which is responsible for the glucosylation of the cyanogenic glucoside dhurrin (Jones et al., 1999; Hansen et al., 2003). Although group H is the second major group of UGTs in A. thaliana, until now enzymatic activity has been described only for two of its members – UGT76C1 and UGT76C2 – which were shown to N-glycosylate cytokinins at the N7 and N9 positions (Hou et al., 2004). Some proteins from phylogenetic group F have been characterized functionally in A. thaliana, V. vinifera and G. max. The A. thaliana UGT78D1 was classified as a UDP-rhamnose:flavonol-3-O-rhamnosyltransferase (Jones et al., 2003), whilst UGT78D2 was classified as a UDP-glucose:flavonol-3-O-glucosyltransferase (Tohge et al., 2005). In V. vinifera, VvGT1 was shown to glucosylate anthocyanidins (Ford et al., 1998; Offen et al., 2006), VvGT5 was characterized as a UDP-glucuronic acid:flavonol-3-O-glucuronosyltransferase and VvGT6 as a bifunctional UDP-glucose/UDP-galactose:flavonol-3-O-glucosyltransferase/galactosyltransferase (Ono et al., 2010). The only enzyme that constitutes the F group in G. max is UGT78K1 that was found to possess flavonol-3-O-glucosyltransferase activity (Kovinich et al., 2010). In A. thaliana group B, UGT89B1 was shown to recognise some flavonoids and benzoates whilst UGT89C1 was shown to convert kaempferol 3-O-glucoside to kaempferol 3-O-glucoside-7-O-rhamnoside and to recognise 3-O-glycosylated flavonols and UDP-rhamnose as substrates, but not flavonol aglycones, 3-O-glycosylated anthocyanins or other UDP-sugars. These results indicated that UGT89C1 is a flavonol 7-O-rhamnosyltransferase (Yonekura-Sakakibara et al., 2007). None of the putative enzymes from this group within those identified in the other plants in this study has been characterized functionally. Nothing is known so far about the function of the other phylogenetic groups (C, I, J, K, M, N) neither in vitro nor in vivo.

Analysis of the PSPG motif

The availability of a large set of putative UGT genes from different plant species provided the opportunity to revise the knowledge available on the PSPG consensus sequences and the possibility to investigate whether or not specific amino acid residues within the PSPG motif could be identified that are characteristic of specific phylogenetic groups. We therefore restricted the analysis to the 44 amino acid positions of the motif and generated sequence logos for each phylogenetic group to get a graphical representation of the sequence conservation and the relative frequency of each amino acid in each position (Crooks et al., 2004). All the logos are provided in Figure S2. In Figure 4, the PSPG motif of groups E, L and O are shown as examples. The stars indicate the residues that interact directly with the UDP-sugar based on the crystal structures of MtUGT71G1 (group E) (Shao et al., 2005), VvGT1 (group F) (Offen et al., 2006), AtUGT72B1 (group E) (Brazier-Hicks et al., 2007b) and MtUGT85H2 (group G) (Li et al., 2007). Our analysis confirmed that some amino acids are very highly conserved, whilst, for others, a certain degree of variation exists. In particular, the highly conserved amino acids were found in positions 1 (W), 4 (Q), 8 (L), 10 (H), 19–24 (HCGWNS), 27 (E), 39 (P), 43 (E/D) and 44 (Q). Some of them participate in the interaction with the invariant part of the sugar donor, whilst residues W (22), D/E (43) and Q (44) are positioned to form hydrogen bonds to the sugar part of the donor (Osmani et al., 2009). The residue Q (44) was found to be very highly conserved across the phylogenetic groups with the exception of groups F and N, in which it is subjected to variation to asparagine (D) or histidine (H), and in histidine, respectively. The amino acid residue at position 44 has been found to determine critically the sugar donor specificity of UGTs. Glutamine (Q) in this position is important for the maximal catalytic efficiency of glucosyl transfer activity, whereas histidine is required for galactosyl transfer activity (Kubo et al., 2004; Ono et al., 2010). Whilst the presence of a glutamine residue at position 44 cannot be considered predictive of glucosyltransferase activity, as this residue can also be found in glucuronosyltransferases, such as the UDP-glucuronic acid:anthocyanin glucuronosyltransferase UGT94B1 from Bellis perennis (Sawada et al., 2005; Osmani et al., 2008) and in bifunctional enzymes (Ono et al., 2010). The analysis of the PSPG motif of other functionally characterized plant galactosyltransferases present in the CAZy database (http://www.cazy.org) showed that they all contain a histidine residue at this position (Figure S3). The effect of the variation of glutamine 44 into asparagine (N), which occurred in some sequences that belong to group F, remains to be investigated. However, it is worth mentioning that this substitution is present in the A. thaliana UGT78D1 (group F) (Jones et al., 2003), characterized functionally as a rhamnosyltransferase. Nonetheless, the role of other residues, located outside the PSPG motif, in enhancement of the recognition of specific sugar donors has been reported (Osmani et al., 2008; Ono et al., 2010).

Figure 4.

 Web logos representing the PSPG motif of phylogenetic groups E, L and O. The stars indicate the residues interacting directly with the UDP-sugar based on available crystal structures. Logos were generated using the WebLogo application available at http://weblogo.berkeley.edu (Crooks et al., 2004).

Although a low degree of conservation was expected for the residues of the PSPG motif that interact with the variable sugar moiety, which should vary depending on the type of sugar, we found that a relatively high variability exists for two residues, the second and the third amino acids, that interact with the UDP moiety. Whilst in the majority of the phylogenetic groups, the second residue is most frequently an alanine (A) or a valine (V) residue, in groups G, J, K and L this position is generally occupied by a cysteine (C) residue or by a serine (S) (only in group L). In the UGTs that were characterized structurally, a proline (P) residue is found in position 3 of the motif that interacts with the invariant part of the sugar donor. In our analysis, we found that this residue is subjected to high variability in proteins clustering in phylogenetic groups A, C, G, J, K and L.

An interesting finding concerns the PSPG motif of the newly discovered group O. In proteins that cluster in this group two highly conserved residues were found at positions 41 and 42, histidine and serine, that are not present in any other phylogenetic group and therefore represent a fingerprint for proteins that cluster in this group.

Mapping of Group D putative UGTs

To get an insight into the evolutionary events that probably occurred during the development of the phylogenetic groups, mapping on the chromosomes of the paralogues UGT genes that constitute group D was investigated in different plants. This group represents a good case study as it is conserved in higher plants and it has expanded at different rates across different lineages. In particular, this group underwent a significant increase in size in G. max when compared with the other examined dicots.

Information on gene localization, at the time of writing, was not yet available in the Phytozome database for all the species considered in this study. Therefore, the analysis included only A. thaliana, V. vinifera, M. domestica, O. sativa, S. bicolor and G. max (Figure 5). The number of chromosomes that contained group D UGTs varied significantly between species. In V. vinifera they were found on three chromosomes, such as in A. thaliana (n = 5), in spite of the larger number of chromosomes present in V. vinifera (n = 19). In G. max, in which group D expanded significantly more than in the other species, the sequences were scattered over 12 of the 19 chromosomes.

Figure 5.

 Localization of the phylogenetic group D putative UGTs on the chromosomes of Arabidopsis thaliana, V. vinifera, M. domestica, O. sativa, S. bicolor and G. max. Each UGT is represented by an arrow that indicates the orientation of the open reading frame (up, antisense; down, sense).

Group D putative genes were found either as isolated genes or organised into clusters. Clusters varied significantly in size and in orientation of the open reading frames. Small clusters that constituted two or three putative UGTs in the same orientation seem to occur more frequently than large clusters in some species, whilst large clusters that included four to 10 open reading frames were found in A. thaliana, O. sativa and G. max. In some of the clusters, UGTs with reading frames that oppose one another were also found. The observation of duplicated genes in tandem, linked in a chromosome, indicates that they probably derive from gene duplication events driven by unequal crossover (Zhang, 2003).

Analysis of sterol/lipid putative UGTs

By searching the genomes using the 44 amino acids signature motif, we excluded the UGTs involved in the glycosylation of lipids and sterols, such as the A. thaliana UGT80A2 and UGT80B1. These enzymes are very divergent from the other plant UGTs as they do not contain an obvious PSPG motif and they seem to be related more closely to non-plant UGT families. In fact, the signature motif of these enzymes is not obvious as only eight amino acid residues match the consensus motif sequence of A. thaliana UGTs, as reported by Paquette et al. (2003). However, we reckoned that an investigation of this class of enzymes in the genomes of the plants selected for this study should be made for completeness. The number of putative sterol/lipid UGTs that we identified was very similar in the different plant species, between two and four in the higher plants, which indicated that radiation has not occurred for these enzymes (Table S2). Alignment of the sequences showed that their primary structure is highly conserved across species, although some degree of divergence is present within the N-terminal domain (Figures S4 and S5). Phylogenetic analysis based on the conserved region (430 amino acid residues) showed that all the enzymes cluster into two big clades, UGT80 and UGT81 (Figure S6).


In this study, we focused on the family 1 glycosyltransferases responsible for the glycosylation of plant secondary metabolites and in particular on those that contain the PSPG motif. To obtain a broad coverage of the putative UGTs that occur in plants, we carefully selected 12 plant species out of the available complete and mostly complete genome projects to cover different stages of land plant evolution. In particular, we analysed an alga, a bryophyte, a lycophyte, two monocots and seven dicots, which included representatives of Rosids, Vitales and Asterids.

Our genome-wide search showed that the UGT family has expanded in land plants, especially in tracheophytes. Bryophytes are separated by flowering plants by more than 300 million years (Clarke et al., 2011) and represent, perhaps, one of the first transition stages in the colonization of land by plants. The higher number of putative UGTs in the bryophyte P. patens compared with the green alga indicated clearly that the adaptation of plants to life on land was associated to a significant expansion of this multigene family. During the transition from Lycophyta to angiosperms, the expansion of this multigene family has continued at various rates among different lineages of vascular plants, in agreement with what has been reported by other authors (Yonekura-Sakakibara and Hanada, 2011). The highest number of putative UGTs was found in M. domestica, but it is possible to speculate that this is mainly the result of the relatively recent genome-wide duplication event that occurred about 50 million years ago in the tribe Pyreae (Velasco et al., 2010).

Mapping of the putative UGT genes from the phylogenetic group D, which is conserved across species and underwent rampant expansion in G. max, on the chromosomes of different plant species revealed that, more than chromosomal (or genome) duplication, the expansion of the subfamily was probably driven by gene duplication by unequal crossover, resulting in the formation of gene clusters on the chromosomes. In general, the outcomes of gene duplication are paralogous genes that retain the same function in a process referred to as concerted evolution. In time, paralogous copies might acquire better or novel functions. This process provides the plant with the plasticity necessary for adaptation to a changing environment. Radiation of the UGT subfamilies probably reflects the physiological challenges that plant had to overcome for survival on land. Plants needed to interact with a new environment, face various abiotic and biotic stress factors and needed to develop strategies to overcome pests, insect attacks and to ensure their own propagation and spreading in a highly competitive environment. During evolution, plants developed a range of novel biological processes, but only a relatively small number of ‘basic’ metabolic pathways of secondary products. It is well known that they produce a huge diversity of natural products by multiple decoration of a common skeleton (Gachon et al., 2005). Thus, diversity can only be achieved by concerted evolution and diversification of genes and gene products with altered functionality in regard to modification of secondary products. UGTs play an important role in the diversification of metabolites. The development of such a large ‘glycosylation toolbox’ appears to be an early acquisition of plants which in Lycophyta had already reached a complexity comparable with higher plants. A similar evolutionary trend has been observed for other multigene families involved in the biosynthesis of secondary metabolites, such as cytochrome P450-dependent monooxygenases, acyltransferases and terpene synthases (Banks et al., 2011). In contrast, the sterol/lipid UGT80 and UGT81 families, which are involved in more conserved reactions, did not undergo radiation during evolution. A. thaliana lines that carry mutations in the UGT80A2 and UGT80B1 genes displayed an array of phenotypes that were pronounced in embryo and seeds, including a transparent testa phenotype and a reduction in seed size (DeBolt et al., 2009). The UGT81 clade is functionally annotated as a UDP-galactose:1,2-diacylglycerol 3-β-d-galactosyltransferase and catalyses the transfer of galactose from UDP-galactose to 1,2-diacylglycerol forming monogalactosylglycerol, which plays a major role in determining the physicochemical characteristics of thylakoid membranes in the chloroplast (Shimojima et al., 1997).

The clustering of UGTs into phylogenetic groups appears to be conserved across all the vascular plants, although gene loss and gene gain events seem to have occurred in some lineages. Previous studies, in which UGT sequences from other plant species were combined with the A. thaliana multigene family for phylogenetic analyses, showed that most of these enzymes fall within the defined 14 groups (Vogt et al., 1997; Bowles et al., 2005; Gachon et al., 2005). However, the Zea mays and Phaseolus vulgaris cytokinin UGTs, indicated as cis-ZOG1, cis-ZOG2, ZOG1, ZOG2 and ZOX (Martin et al., 1999; Martin et al., 2001; Veach et al., 2003) were shown to form a unique branch on the phylogenetic tree that contained 107 A. thaliana UGTs (Hou et al., 2004; Bowles et al., 2005). These enzymes are responsible for the O-glucosylation of the hydroxylated isoprenoid side chain at the N6-position of the plant hormones cytokinins, which regulate many developmental events in plants. Our phylogenic reconstruction of the multigene family displayed the existence of two phylogenetic groups, named O and P, which are not present in A. thaliana, and revealed that they were lost at some stage during the evolution of the plant. This situation might be due to the massive genome reduction occurred in this species. A combined phylogenetic analysis, which included the A. thaliana 107 UGTs, the sequences constituting the groups O and P and the Z. mays and P. vulgaris cytokinins UGTs, showed that these enzymes clustered in the phylogenetic group O (an alignment that includes some group O UGTs from different plants and the ZOG enzymes is shown in Figure 6). These results are in contrast with those reported by Yonekura-Sakakibara and Hanada (2011) who found that the ZOG cytokinin glycosyltransferases from P. vulgaris and Z. mays clustered with the A. thaliana UGT79, UGT91 and UGT92 families and postulated that these enzymes may act as cytokinin glycosyltransferases in planta, because their activity in vitro could not be demonstrated. Our analysis showed that UGT79 and UGT91 enzymes cluster together forming group A, whilst UGT92 sequences constitute the phylogenetic group M, in agreement with Ross et al. (2001). Further evidence that supports the clustering of the ZOG enzymes into group O came from the detailed analysis of their PSPG motif. Interestingly, we identified two residues, H(41) and S(42), that are highly conserved and are specific of this phylogenetic group (Figure 6). Whether these residues represent only a fingerprint of the group O UGTs or play a role in the enzyme activity will be investigated in future studies. For the newly discovered group P, we have been not yet able to identify any functionally characterized enzyme that belongs to this group but this is the subject of ongoing research.

Figure 6.

 Amino acids sequence alignment of group O and ZOG UGTs. The sequences were aligned using Muscle v3.8.31 software and visualized using Jalview (Waterhouse et al., 2009).

Although no conserved amino acid residues have been identified as general determinants of sugar specificity (Ouzzine et al., 2002; Kubo et al., 2004; Modolo et al., 2007; Yonekura-Sakakibara et al., 2007), the structural investigation of plant UGTs revealed the role of specific amino acid residues that were highly conserved in the PSPG motif. However, the roles played by less conserved amino acids within the motif have been shown to be not less important in determining the characteristics that are unique to particular enzymes, such as substrate recognition and catalytic potential, to the extent that a single non-conserved amino acid mutation in curcumin glucosyltransferase CaUGT2 (group D) critically affects the enzyme activity (Masada et al., 2007). We found that a certain degree of variation exists in specific positions inside this highly conserved motif. The occurrence of specific amino acids in those positions is a feature of one or a few phylogenetic groups, and provides some evolutionary and/or functional information in its sequence that could be helpful for enzyme discovery.

Experimental Procedures

Identification of putative UGTs in the selected plant genomes

The genomes of Vitis vinifera, Glycine max, Sorghum bicolor, Oryza sativa, Populus trichocarpa, Cucumis sativus, Mimulus guttatus, Selaginella moellendorffii, Physcomitrella patens and Chlamydomonas reinhardtii were searched for putative UGTs through the Phytozome v5.0 database (http://www.phytozome.net). The Malus domestica putative UGTs were identified in the genome v. 1.0 sequenced and assembled at IASMA-FEM (http://www.applegenome.org) (Velasco et al., 2010) and, at the time of writing, available on the Genome Database for Rosaceae (http://www.rosaceae.org). The A. thaliana UGTs protein sequences were obtained from the A. thaliana cytochrome P450, cytochrome b5, P450 reductase, β-glucosidase and glycosyltransferase site (http://www.p450.kvl.dk).

The strategy adopted in this study, to retrieve all the putative UGT sequences from the different plant species, relied on the protein BLAST (BLASTP) search using the 44 amino acid PSPG motif. PSPG sequences from UGTs annotated in the CAZy database (http://www.cazy.org) were used to identify the multigene family in the plant of origin. Therefore, we used the PSPG motifs of the following UGTs: VvGT1 for V. vinifera, UGT706D1 for O. sativa, GmGT1 for G. max, UGT88F1 for M. domestica and UGT85B1 for S. bicolor. For BLASTP searches in the remaining plant genomes, we used the signature motif of A. thaliana UGT72B3 for a first screen, followed by a second search using the signature motif of the best hit. A BLASTP search was performed using a blosum62 matrix and gapped alignments. In order to obtain a more comprehensive set of genes, the Phytozome database was also searched for genes annotated as ‘glycosyltransferases’, using the keyword search. The hits obtained from the two searches were then combined and the redundant sequences were removed. We reckoned that further analyses should be carried out on the sequences before using them for alignment and phylogeny reconstruction. In fact, in our experience, the BLASTP and HMMER-based sequence similarity approaches applied to whole genomes searches are subjected to identification of non-plausible proteins that should be removed; for instance, truncated proteins or very large protein sequences generated by wrong splicing sites recognition during the gene annotation process.

Putative sterol/lipid UGTs were identified in the genomes of the selected species by BLASTP search using the amino acid sequences of A. thaliana UGT80B1 and UGT81A1.

Analysis using TargetP (Emanuelsson et al., 2007) was performed to identify sequences carrying a signal peptide for subcellular localization.

Phylogeny reconstruction

For phylogeny reconstruction the downloaded sequences were aligned all together with the help of Muscle v3.8.31 software (Edgar, 2004). After an initial alignment that included all the sequences, a large number of sequences that were either too divergent due to, probably, frame shift errors, too short (truncated) or too long and sequences that contained stop codons were identified. These were removed from the input file. Then we re-aligned the remaining sequences to obtain a second alignment of better quality, which was also subjected to visual inspection, and sequence removal. The file obtained after elimination of dissimilar sequences included only the sequences that contained the UGTs conserved domains, which could be well aligned among the whole set of operational taxonomic units (OTUs). This file was then aligned using Muscle v3.8.31 and edited manually using Seaview alignment editor to discard all regions where homology of aligned positions was not obvious. With this approach, pseudogenes, characterized by frame shift errors and by presence of stop codons, were not taken into analysis.

Final alignment was 278 positions long and included 1520 putative UGT sequences (Table S1). Phylogenetic analyses by the ML method were performed employing fast, improved, hill-climbing algorithm as implemented in the RAxML v. 7.0.4. (Stamatakis et al., 2007). The fixed amino acid frequencies model (JTT matrix) with in-run optimization for the proportion of invariable sites and the categorized gamma distribution of rates across sites (four categories) was used. With this model the analysis ran for more than 10 days. To get statistical support for branches ‘rapid’ bootstrap analysis option (-f a) was chosen to generate 100 non-parametric bootstrap replicates. The phylogenetic tree was visualized using the program dendroscope v.2.7.4 (Huson et al., 2007). Alignments were visualized using the multiple alignment editor Jalview 2 (Waterhouse et al., 2009). Logos for the PSPG motif of each of the phylogenetic groups were generated using the WebLogo application available at http://weblogo.berkeley.edu (Crooks et al., 2004).


This research was supported by the Autonomous Province of Trento, Italy, ‘ADP2010’ Project.