Complex evolutionary history and diverse domain organization of SET proteins suggest divergent regulatory interactions


  • Liangsheng Zhang,

    1. State Key Laboratory of Genetic Engineering, Institute of Plant Biology, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai 200433, China
    Search for more papers by this author
  • Hong Ma

    1. State Key Laboratory of Genetic Engineering, Institute of Plant Biology, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai 200433, China
    2. Institutes of Biomedical Sciences, Fudan University, Shanghai 200032, China
    Search for more papers by this author

Author for correspondence:
Hong Ma
Tel: +86 21 65642800


  • Plants and animals possess very different developmental processes, yet share conserved epigenetic regulatory mechanisms, such as histone modifications. One of the most important forms of histone modification is methylation on lysine residues of the tails, carried out by members of the SET protein family, which are widespread in eukaryotes.
  • We analyzed molecular evolution by comparative genomics and phylogenetics of the SET genes from plant and animal genomes, grouping SET genes into several subfamilies and uncovering numerous gene duplications, particularly in the Suv, Ash, Trx and E(z) subfamilies.
  • Domain organizations differ between different subfamilies and between plant and animal SET proteins in some subfamilies, and support the grouping of SET genes into seven main subfamilies, suggesting that SET proteins have acquired distinctive regulatory interactions during evolution. We detected evidence for independent evolution of domain organization in different lineages, including recruitment of new domains following some duplications.
  • More recent duplications in both vertebrates and land plants are probably the result of whole-genome or segmental duplications. The evolution of the SET gene family shows that gene duplications caused by segmental duplications and other mechanisms have probably contributed to the complexity of epigenetic regulation, providing insights into the evolution of the regulation of chromatin structure.


Gene duplication is a fundamental source of new genes during evolution and can provide novel opportunities for evolutionary success (Ohno, 1970; Van de Peer et al., 2009). In addition, whole-genome duplications (WGDs) can simultaneously generate a large number of duplicates and have been documented in animals, plants and fungi (Van de Peer et al., 2009). Vertebrates, many insects and most flowering plants are either polyploid or have a polyploid history, but polyploidy is much rarer in animals than in plants (Dehal & Boore, 2005; Maere et al., 2005; Van de Peer et al., 2009). The analysis of the Hox gene clusters supports the idea that vertebrates experienced two rounds of genome duplications in their early evolution (the 2R hypothesis) (Dehal & Boore, 2005; Ravi et al., 2009). Evidence for WGDs has been detected in all sequenced angiosperms, including at least five rounds of WGDs in Arabidopsis thaliana (Maere et al., 2005; Jiao et al., 2011). WGDs at important diverging points in eukaryotic evolution and the resulting gene amplifications may have contributed to the appearance of lineage-specific novelties (Maere et al., 2005; Van de Peer et al., 2009).

Epigenetic changes in chromatin structure define gene expression potential during development and are controlled by complex DNA and histone modification systems, including modifications of N-terminal histone tails, such as acetylation, methylation, phosphorylation and ubiquitination (Berger, 2007). These histone modifications show characteristic differences between transcriptionally active and repressed chromatin states (Berger, 2007). Histone methylation can occur at various lysine and arginine residues, including K4, K9, K27, K36 and K79 in histone H3 and K20 in histone H4. Methylation on H3K9, H3K27 and H4K20 is generally associated with silenced chromatin regions, but on H3K4, H3K36 and H3K79 is associated with active genes (Berger, 2007). At least 24 sites of lysine and arginine methylation have been identified on H3, H4, H2A and H2B, making the number of different possible nucleosomal methylation states enormous. Lysine can be mono (mel), di (me2) or tri (me3) methylated, whereas arginine can be mel or me2 methylated, further increasing the complexity in the epigenetic histone modification system.

All known histone lysine methyltransferases, except Dot1 for H3K79 methylation (Feng et al., 2002), share a highly conserved SET domain that contains the catalytically active sites and binds to the S-adenosyl-l-methionine cofactor (Dillon et al., 2005). The SET domain has c. 130 amino acids and was initially found in three Drosophila melanogaster proteins: Suppressor of variegation 3-9 (Su(var)3-9; or Suv) (Tschiersch et al., 1994), Enhancer of Zeste (E(z)) (Jones & Gelbart, 1993) and Trithorax (Trx) (Stassen et al., 1995). SET domain proteins have been classified into four subfamilies typified by their Drosophila homologs: E(z), Trx, Ash1(Ash) and Su(var)3-9 (Jenuwein et al., 1998). The Suv and E(z) subfamilies are mainly involved in gene repression, whereas members of the Trx and Ash subfamilies are positive regulators of targets, such as homeotic genes (Dillon et al., 2005).

To understand the relationship of genes encoding SET domain proteins (hereafter referred to as SET genes) and to obtain clues about their functional evolution, it is important to investigate their distribution and history among eukaryotes. Previous comparative analyses have indicated that the E(z), Trx, Ash and Suv subfamilies are evolutionarily conserved in eukaryotes (Baumbusch et al., 2001; Springer et al., 2003; Ng et al., 2007). However, the evolution of SET genes has generally been studied with a small number of representative species (Baumbusch et al., 2001; Springer et al., 2003; Alvarez-Venegas et al., 2007; Ng et al., 2007; Veerappan et al., 2008; Zhang et al., 2009; Huang et al., 2011; Zhu et al., 2011). Therefore, the pattern of conservation and diversification of SET genes across plants and animals remains largely uncertain. Here, we use comparative and phylogenetic approaches to investigate the evolutionary history of SET genes in many completely sequenced plant and animal genomes. Our analysis identified numerous gene duplication events in Suv, Ash, Trx and E(z) subfamilies and supports a model of the evolutionary history of the SET gene family.

Materials and Methods

Data sources

Sequences for the A. thaliana and rice SET genes and predicted proteins (Zhang et al., 2009) were retrieved from the TAIR9 Genome Release ( and Rice Genome Annotation Project (V6.1) (, respectively. The complete genome and predicted proteome sequences of plants and animals were obtained from JGI ( and Phytozome v7.0 (, respectively. The Ensembl genome assemblies (version 55; of animals were additional sources for genome and predicted proteome sequences. The Neurospora crassa release7 and some fungal genome sequences were retrieved from the Broad Institute ( and JGI. In proteome datasets, if two or more protein sequences at the same locus were identical where they overlapped, we selected the longest sequence.

Homolog searches

The hidden Markov model-based HMMER program (2.3.2) (Eddy, 1998) was used to identify all proteins containing a SET domain. The SET domain (PF00856 in Pfam database) (Finn et al., 2008) was used to perform local searches in the downloaded proteome datasets. These sequences were then manually adjusted in multiple sequence alignments, and the sequences with obvious errors were excluded from subsequent analyses. To find similar protein regions from unannotated genomic regions, the manually verified sequences (domains) were employed as queries for gene search using newly developed software, Phoenix (Protein Homologue Extraction; Y. Sun et al., The Pennsylvania State University, University Park, PA 16802, USA, unpublished; available on request), against nucleotide sequences.

Sequence alignment

Multiple sequence alignment using MUSCLE with the default parameters was performed with the SET protein sequences (Edgar, 2004) and adjusted manually. A preliminary neighbor-joining (NJ) tree was then generated using MEGA (Tamura et al., 2011) to determine the numbers and compositions of orthology groups. Each orthology group was aligned again separately and then combined using profile alignment in MUSCLE. Multiple sequence alignment of representative sequences generated by MUSCLE was used as reference for manual improvement. Other conserved regions in the SET proteins were screened against the SMART database ( and Pfam ( The alignments used here are available on request.

Phylogenetic and synteny analyses

The phylogenetic trees using SET protein sequences were constructed using the NJ, maximum likelihood (ML) and Bayesian methods. NJ trees were constructed using MEGA with the ‘pairwise deletion’ option and ‘Poisson correction’ model, with a bootstrap test of 1000 replicates. ML trees were generated using PHYML version 3.0.1 (Guindon & Gascuel, 2003) with 100 nonparametric bootstrap replicates and FastTree ( with the approximate likelihood ratio test (aLRT) method. MrBayes version 3.1.2 ( software was used to construct Bayesian trees after running for 1 000 000 generations, with four Markov chains, and sampling every 1000 generations. Synteny was detected using the Synteny Database (Catchen et al., 2009) for animals and the Plant Genome Duplication Database (Tang et al., 2008a,b) for plants.

Expression analysis

RNA-Seq datasets of the following Arabidopsis tissues were mapped to the Arabidopsis TAIR9 genome annotation cDNA sequences with PerM (Chen et al., 2009): seeding (Filichkin et al., 2010); stage 4 flower (Jiao & Meyerowitz, 2010); stage 1–9 flower; stage 12 flower; and meiosis (Yang et al., 2011). Only the uniquely mapped sequence reads were used further. The gene expression levels were measured by RPKM (reads per kilobase of mRNA length per million of mapped reads). Expression of SET duplicates was also obtained from Arabidopsis and human public microarray datasets (Su et al., 2004; Schmid et al., 2005) and RNA-Seq data of rice (Sakai et al., 2011). The Arabidopsis meiosis transcriptome and the human and rice meiosis microarray datasets (Chalmel et al., 2007; Tang et al., 2010) were used to estimate SET gene expression in meiosis.


Land plants and vertebrates have large numbers of SET genes

Many eukaryotic genomes have now been sequenced, resulting in a very large dataset for the identification of SET genes. We conducted a comprehensive search of SET genes in 165 sequenced genomes and identified 5536 SET genes (Supporting Information Table S1). We used 34 sequenced plant genomes and then selected other genomes for further analysis, including a representative set of 23 nonplant species (19 metazoans, three fungi and Monosiga brevicollis, a unicellular choanoflagellate; Table S1), allowing us to conduct an in-depth analysis of the evolutionary history of this gene family in plants and animals.

In an attempt to make the names of the SET genes less confusing, we have used a nomenclature system based on the previously reported Arabidopsis and human SET gene names (Baumbusch et al., 2001; Sun et al., 2008) and the phylogenetic relationships here. In each orthology group (defined as genes/proteins derived from a progenitor in the common ancestor of a group of organisms, in a similar manner to that used previously (Springer et al., 2003)), the Arabidopsis and human genes follow the standard gene symbol conventions with all capital letters; for genes from other plants and animals, the first letter is capitalized but the others are in lower case, using the same name as the Arabidopsis or human genes in the orthology group. For example, if a rice gene is in the orthology group as the Arabidopsis CLF gene, it is named as Clf. The specific number following the letters does not necessarily imply strict orthology, because of lineage-specific gene duplication and loss.

Phylogenetic classification of Arabidopsis and human SET genes into different subfamilies

To analyze the phylogeny of SET genes, we first used Arabidopsis and humans to represent plants and animals, respectively, and identified 47 Arabidopsis and 50 human SET genes for phylogenetic analysis using ML, NJ and Bayesian methods. Although the ML and NJ analyses yielded low resolution for deep relationships, the Bayesian results allowed the classification of these genes into seven major groups (Suv, Ash, Trx, E(z), PRDM, SMYD and SETD; Figs 1, S1). In addition to the Suv, Ash, Trx and E(z) subfamilies with both plant and animal homologs (Springer et al., 2003; Veerappan et al., 2008), we identified the SMYD and SETD subfamilies, which also have plant and animal homologs with low levels (10–30%) of sequence identity in the SET domain (Fig. S2). However, the PRDM genes were previously thought to have originated in metazoans, expanded in vertebrates, with further duplication in primates (Fumasoni et al., 2007). Our phylogenetic results suggest that PRDM also originated in the common ancestor of plants and animals, but nonanimal members were lost or too divergent to be identified definitively (Supporting Information Fig. S1). The PRDM proteins have only 10–40% amino acid sequence identity in the SET domain, except between the recent primate duplicates (Fig. S2). The SETD and SMYD proteins share an insertion of c. 100–300 residues in the middle of the SET domain (i-SET). By contrast, the SET domains of the Suv, Ash, Trx and E(z) subfamily members exhibit > 30% sequence identity within the subfamilies (Fig. S2).

Figure 1.

Phylogenetic analysis of Arabidopsis (At.) and human (Hs.) SET genes. The Bayesian tree based on an alignment of the SET domains is shown. For major nodes, maximum likelihood (ML) and neighbor-joining (NJ) bootstrap values above 60%, and Bayesian posterior probabilities > 0.90 are shown. –, NJ or ML bootstrap values below 60% or posterior probabilities below 0.90.

Previous studies have classified plant SET genes into four groups, namely the Su(var)3-9 (Suv) group (including Suv homologs (Suvh) and Suv-related genes (Suvr)), E(z) homologs, the Trx group (Trx homologs (ATX) and Trx-related genes (ATXR)) and the Ash1 group (Ash1 homologs and Ash1-related genes (ASHR)) (Baumbusch et al., 2001; Springer et al., 2003; Ng et al., 2007). However, our results showed that only the ATXR7 and ASHR3 genes are respective members of the Trx and Ash subfamilies as defined by the analysis using genes from Arabidopsis and humans, whereas most ATXR and ASHR genes in Arabidopsis are more closely related to members of SMYD (Fig. 1).

Phylogenetic analysis of plant genes reveals orthology groups and multiple SET gene duplications after the divergence of land plants

To further understand the evolution of SET genes in plants, we analyzed the phylogeny of 1597 SET genes in sequenced plant genomes (Table S1). Our analyses assigned most plant SET genes to six groups, named the Suv, Ash, Trx, E(z), SMYD and SETD subfamilies, the same as six of the seven subfamilies in Arabidopsis and humans (Figs S3–S9). We defined 33 (7, 4, 4, 1, 7 and 10) orthology groups in the six subfamilies (Suv, Ash, Trx, E(z), SMYD and SETD), respectively. Interestingly, 16 orthology groups (1, 2, 3, 1, 1 and 8 in Suv, Ash, Trx, E(z), SMYD and SETD subfamilies, respectively) had genes from chlorophytes. Most orthology groups maintained singe copy genes in each species in SMYD and SETD. Among the seven Suv orthology groups, four were Suvr genes, each of which included moss genes (Fig. S4), indicating that they originated before the diversification of land plants. The Suvr3 (S7) group also included some algal genes, suggesting that more than one Suvr gene might have existed before the split of land plants and green algae. In SMYD, most groups included moss genes, indicating that the groups originated at least as early as the ancestors of land plants (Fig. S8). Only the ATXR3 group included many algae genes. In summary, many algae SET genes were lost in the Suv and SMYD subfamilies. Most plant SETD orthology groups contained members from both land plants and green algae (Fig. S9), suggesting that they originated in the common ancestor of land plants and chlorophytes. In addition, the Trx subfamily had a new orthology group, named Atx6, with members from land plants, including mosses, grasses and a subgroup of eudicots (Fig. 2), but not Arabidopsis and Brassica rapa (Fig. S5), suggesting that it was lost in Brassicaceae.

Figure 2.

Neighbor-joining (NJ) trees showing the evolutionary relationship among the Suv, Ash, Trx and E(z) subfamilies from land plants and algae. The Suv and Ash trees were constructed using the SET domain plus PreSET and AWS domains, respectively, and Trx and E(z) trees using only the SET domain. Only NJ bootstrap values of at least 60% are shown. The names of species are given as in Table 1. Each duplicate gene pair was assigned a color and a line on the right.

To understand the evolution patterns of plant Suv, Ash, Trx and E(z) subfamilies, we performed phylogenetic analyses with members of these four subfamilies from representative land plants and chlorophytes, as shown in Fig. 2. The Suv, Ash, Trx and E(z) subfamilies had a number of recent clades of lineage-specific paralogs, including 14 (7, 2, 3 and 2 in each Suv, Ash, Trx and E(z)), seven (3, 1, 2 and 1), four (4, 0, 0 and 0) and nine (4, 1, 4 and 0) such clades in poplar, Arabidopsis, rice and moss, respectively, revealing numerous gene duplication events (Fig. 2). The phylogenies of individual subfamilies with additional plant species (Figs S3–S7) uncovered many lineage-specific duplications in the Suv (see the next paragraph) and Trx subfamilies, and fewer events in the Ash and E(z) subfamilies. In the Trx and Ash subfamilies, in addition to poplar, soybean had many duplication events. In addition, two Brassicaceae-wide events were found in T1 and T2 and another shared by the grasses (Poaceae) in A3.

The plant Suv SET genes form two clades, named Suvh and Suvr, which can be subdivided into four (S1–S4; Fig. S3) and three (S5–S7; Fig. S4) orthology groups, respectively. As mentioned above, during angiosperm history, there were many lineage-specific duplications, including duplications shared by members of the Brassicaceae found in the S1, S2, S3 and S7 orthogroups; similarly, duplications in S1, S2, S3 and S7 were detected before the divergence of the grasses. Additional duplications were found in poplar, soybean and other species. It was noted that the Suvh SET genes were amplified in angiosperms and experienced retrotransposition-like events before the monocot–eudicot split (Baumbusch et al., 2001; Springer et al., 2003), but there were fewer copies in the nonseed plants Selaginella and moss. The Suvh genes also have two distinct gene structures: one with many small exons and the other with one or a few large exons (Fig. S10). The S1, S2 and S3 genes lack any introns, whereas many introns were detected in S4 genes found in flowering plants, moss and Selaginella (Fig. S10), suggesting that a retrotransposition-like event occurred before the divergence of land plants. We hypothesize that the ancestral Suvh SET gene in the early land plants had a structure similar to that of S4 genes; then, a retrotransposition-like event occurred, with subsequent copy number maintenance in Selaginella and moss and duplications in angiosperms. The angiosperm-specific duplications resulted in orthology groups S1–S3, as these do not have genes in Selaginella and moss (Fig. S3). The Suvh clade also includes a few algal genes (Fig. S3), and the predicted algal proteins have a SRA domain and similar domain structures to other Suvh proteins (data not shown), suggesting that the Suvh genes had an early origin before the divergence of land plants and green algae.

Phylogenetic analysis of vertebrate and invertebrate genes reveals orthology groups

As a comparison, we analyzed the phylogeny of 727 SET genes in 19 vertebrate and invertebrate genomes (Tables S1, S2). The results indicated that there are seven major well-supported phylogenetic clades (Figs S11, S12), corresponding to the seven subfamilies described in Fig. 1. Based on the tree topology and support values, we defined 31 orthology groups in the seven subfamilies (4, 3, 3, 1, 8, 9 and 3 groups in the Suv, Ash, Trx, E(z), PRDM, SMYD and SETD subfamilies, respectively). Among the 31 orthology groups, 26 included both invertebrate and vertebrate genes, suggesting that they originated in the common ancestors of invertebrates and vertebrates. By contrast, the remaining five groups only contained vertebrate genes, with three and two groups in the PRDM and SMYD subfamilies, respectively (Figs S11, S12). The phylogenetic topology of the PRDM subfamily (Fig. S12b) suggests that some or all of the groups with only vertebrate members also originated in early metazoans, but the invertebrate members were either lost or became too divergent. Alternatively, the vertebrate-only PRDM groups could also have been the result of expansion during vertebrate evolution, as proposed by a previous study (Fumasoni et al., 2007).

The SMYD subfamily can be divided into nine orthology groups, named SMYD1–SMYD5, SUV4-20, SETD7, SETD8 and MLL5 (Fig. S12a). The SMYD1 and SMYD2 groups include only vertebrate members, whereas the others have both vertebrate and invertebrate genes. The SMYD1–SMYD5 groups form a monophyletic group with high support values (ML: 88), suggesting that the SMYD1–SMYD5 genes originated from an ancestral gene in the common ancestor of vertebrates and invertebrates. The SMYD1–SMYD3 groups are each well supported (ML: 81–100) and form a large clade (ML: 99) sister to the SMYD4 and SMYD5 groups, suggesting that the SMYD1– SYMD3 groups originated from at least one ancestral gene in the common ancestor of vertebrates and invertebrates, and might have expanded in vertebrates or were lost in invertebrates.

Duplication of animal and plant SET genes as a result of genome duplications

From the phylogenetic analyses of animal SET genes, most orthology groups have a single copy in invertebrates, but two or more copies in vertebrates, except for the ASH1 and SETD2 orthology groups in the Ash subfamily and the SETMAR orthology group in the Suv subfamily (Figs 3, S11). The vertebrate sister orthology groups have similar topologies and were assigned numbers 1 and 2. Although some orthologs were lost in specific vertebrate lineages, the general patterns of SET genes were similar in mammals, chicken, frog and zebrafish, indicating that the corresponding duplication events probably occurred in the common ancestors of vertebrates.

Figure 3.

Neighbor-joining trees showing the evolutionary relationship of the Suv, Ash, Trx and E(z) subfamilies from animals and the protist Monosiga brevicollis. Tree construction and bootstrap values are given as in Fig. 2.

One possibility is that the duplicate vertebrate SET genes were generated by genome-wide duplications. To test this idea, we searched for possible synteny in the genomic regions containing the SET genes in invertebrates and vertebrates, by analyzing genes with flanking positions of SET genes in humans and the Urachordate Ciona intestinalis, also called the sea squirt, and found that genes near SET genes are members of other gene families, such as Notch genes, FGFR genes and Hox genes (Fig. 4). Up to four members in each family were found in syntenic positions, consistent with two large-scale genome duplications (WGDs) in early vertebrate evolution (Itoh & Ornitz, 2004; Theodosiou et al., 2009). The syntenic relationship of the genomic regions with SET genes and members of other gene families suggests that these SET genes also arose by large-scale genome duplications in vertebrates, with at least two maintained. Some orthology groups have three or four copies in Fugu and zebrafish, and even more in PRDM, SETD and SMYD subfamilies (Figs 3, S11, S12), more than the numbers in humans and mice. The increased numbers of paralogs in zebrafish are consistent with a reported WGD in teleosts (Sun et al., 2008). These results support the hypothesis that the ancestral animal had at least 26 ancestral SET genes, which were further increased in vertebrates, with most duplicates retained in the Suv, Ash, Trx and E(z) subfamilies. Furthermore, most of these expansions were a result of genome duplications in vertebrates or in fishes, illustrating the effect of WGD on the complexity of the histone methylation system.

Figure 4.

Schematic representation of the chromosomal position of some SET genes and of their neighboring genes in humans (Hs) and the urochordate Ciona intestinalis (Ci). (a) EHMT1 and EHMT2. (b) NSD1, WHSC1 and WHSC1L1. (c) EZH1, EZH2, MLL2 and MLL3. SET genes are indicated by red boxes; other homologous genes are also highlighted in the same color and are designated by similar names, such as TUBB4, TUBB and TUBB2C. The human and Ciona intestinalis chromosome names that contain these regions are indicated on the right.

Most orthology groups in the SMYD and SETD subfamilies have one copy in each species, but land plants carry more copies than green algae in the Suv, Ash, Trx and E(z) subfamilies (Tables 1, S1). Therefore, land plant SET genes experienced significant expansion in these four subfamilies, especially the Suv subfamily, the largest subfamily in land plants (Table 1). The analysis of SET genes in Arabidopsis and rice indicated that five pairs of SET genes are located in five different large genomic duplications in Arabidopsis and rice, respectively (Zhang et al., 2009). Compared with the papaya genes, the number of Suv genes increased in Brassicaceae and Poaceae (grasses) (Figs S3, S4), consistent with proposed separate genome duplications before the divergence of Brassicaceae or Poaceae. Furthermore, in B. rapa, there are many small clades, each containing two B. rapa genes (gene pair) and a single Arabidopsis gene (Fig. S13a). We found that, for seven gene pairs, the recent paralogs were located in syntenic regions in the B. rapa genome (Fig. S13b), indicating that the duplicates were probably a result of a recent genome duplication in B. rapa. Similar situations were also observed in poplar, which has more gene pairs (14/14) than Arabidopsis and rice located in large genomic duplication regions (data not shown). Therefore, segmental/genome duplication was a major reason for the expansion of plant SET genes (particularly Suv genes), similar to the expansion of SET genes in vertebrates. Although the genome duplications occurred independently in plants and animals, the observation that many SET duplicates have been retained in both land plants and vertebrates suggests that an increasingly complex histone methylation system might have contributed to the evolution of sophisticated gene regulation mechanisms underlying the morphological and physiological complexities in these parallel dominant forms.

Table 1.   Number of SET genes in plants and animals
TaxonomySpecies nameAbbreviationSuvAshTrxE(z)SMYDSETDPRDMAll*
  1. *These only include the SET domain sequences that can be aligned, not the highly divergent sequences. The SMYD family includes ATXR3, 5 and 6 members in plants and MLL5, SETD5 and SETD7/8 members in animals.

AngiospermsArabidopsis thalianaAt15563810047
Populus trichocarpa (poplar)Po186941111059
Oryza sativa (rice)Os15552810045
BryophytesPhyscomitrella patens (moss)Pp114101712045
ChlorophtesVolvox carteri f. nagariensisVc224117017
Chlamydomonas reinhardtiiCr214015013
VertebratesHomo sapiens (human)Hs75621131650
Mus musculus (mouse)Mu75621131549
Gallus gallus (chicken)Ga55321121139
Xenopus tropicalis (frog)Xe656173937
Takifugu rubripes (fugu)Fu76921231655
Danio rerio (zebrafish)Dr961021741260
InvertebratesCiona intestinalisCi433153221
Branchiostoma floridaeBf45421111946
Sea urchinSe634154427
Caenorhabditis elegansCe1222143226
Drosophila melanogasterFl433162120
Nematostella vectensisNv325163828
ChoanoflagellateMonosiga brevicollisMo232142014

The early evolutionary history of the Suv, Ash, Trx and E(z) subfamilies before the divergence of animals and plants

The Suv, Ash, Trx and E(z) subfamilies were present before the divergence of animals and plants. To investigate the early evolution of these subfamilies, we carried out phylogenetic analyses with human, fly, Arabidopsis, N. crassa and yeast SET genes. These genes formed four major groups with high bootstrap support (98/79/97), consistent with the presence or absence of group-specific domains (Fig. 5). The tree topology indicates that each subfamily has multiple copies in animals and plants, as well as at least one copy in the multicellular fungus N. crassa; therefore, we concluded that each of the Suv, Ash, Trx and E(z) subfamilies originated from at least one ancestral gene in the most recent common ancestor (MRCA) of animals, fungi and plants (Fig. 6). In particular, the Trx subfamily has three clades, two of which have high bootstrap values (96 and 88), and each contains animal and plant genes, suggesting the presence of three ancestral Trx genes in the MRCA of animals, fungi and plants. Similarly, the Ash and Suv subfamilies each had at least three members before the divergence of plants, fungi and animals (Fig. 6).

Figure 5.

A neighbor-joining (NJ) tree showing the evolutionary relationship and domain architectures among Suv, Ash, Trx and E(z) from humans (Hs), fly (Fly), Arabidopsis (At), Neurospora crassa (Nc) and the budding yeast (Yeast). (a) The tree was constructed using the SET domain. Only NJ bootstrap values of at least 60% are shown, unless the difference in the values between the NJ and maximum likelihood (ML) trees is 5% or more. (b) The domain architectures of the full-length proteins were drawn based on a search of SMART and Pfam (including Pfam-A and Pfam-B) databases. Most domain names and diagrams were from SMART. The domains found using the Pfam-A and Pfam-B databases are denoted by ‘PA’ and ‘PB’, respectively. The name of each domain is indicated above or below the first occurrence. The asterisks denote that the pairs of protein domain architectures are different. (c) Known histone methyltransferase functions.

Figure 6.

A model for the evolutionary history of Suv, Ash, Trx and E(z) subfamilies. There was at least one copy of Suv, Ash, Trx and E(z) genes in the early stage of eukaryotes. In the Suv, Ash and Trx subfamilies, the early expansion occurred before or just after the divergence of plants and animals. Many duplication events occurred independently after the divergence of plants and animals in vertebrates and land plants, most of which may have been the result of whole-genome (WGD) or segmental duplications, representing recent expansion. Possible gene losses are shown in fungi and algae. GD, gene duplication; GL, gene loss. Dr, zebrafish; Hs, humans; Nc, Neurospora crassa; At, Arabidopsis; Po, poplar; Os, rice; Pp, Physcomitrella patens (moss).

Rapid evolution of the animal PRDM and plant Suv subfamilies

It is thought that different gene families experience different selective forces because of their functions, and exhibit differential evolution rates (Vogel & Chothia, 2006; Jin et al., 2009). Similarly, different subfamilies of a gene family can also show distinct rates of evolution, especially for those having multiple functions, such as the MADS-box or F-box genes (Nam et al., 2004; Xu et al., 2009). The levels of sequence similarity and the tree branch lengths of SET genes suggest that the members of different SET subfamilies had different overall evolutionary rates during animal and plant histories. To examine the rate of evolution further for more recent times, we analyzed the dN/dS ratios between orthologs from two closely related animal species, humans and chimpanzees, and from two plant species, A. thaliana and A. lyrata, separated by 5 and 10 million years, respectively. Orthologs were detected as reciprocal best hits using all-by-all BLASTp searches. Our analysis of dN/dS values between A. thaliana and A. lyrata orthologs indicates that the Suv genes obviously evolved faster than the other subfamilies (Fig. 7a; Table S3). In animals, the PRDM subfamily members evolved more rapidly than most members of the other subfamilies (Fig. 7b; Table S3). One of the members, PRDM9, was involved in binding to meiotic recombination hotspots and initiation of recombination (Baudat et al., 2010; Parvanov et al., 2010).

Figure 7.

Rates of synonymous (dS) and nonsynonymous (dN) substitutions between orthologous SET genes. For easy visualization, we plotted dS on the x-axis and dN/dS on the y-axis. (a) Arabidopsis thaliana and A. lyrata. *, SUVH8 has two paralogs in A. lyrata. (b) Humans and chimpanzees.

Diverse domain organizations of SET proteins

To investigate the SET protein domain organizations in the context of their phylogeny, we used SMART and Pfam databases to perform protein domain prediction. The results showed that SET proteins, particularly those in the Ash, Trx and Suv subfamilies, contain many other domains, including PHD, PWWP, MBD, AWS, PreSET and SRA (Fig. 5b). We also identified a number of unknown domains in Pfam-B databases, such as the PB014572 domain found in the ASHH2 protein (Fig. 5b). A previous study and our results suggest that SET proteins in one subfamily usually have one or more characteristic domains that are different from those of other subfamilies (Fig. 5b). The conservation of several domains in SET domain proteins of both animals and plants suggests that they are important for the function of the SET domain proteins in the respective subfamilies.

In addition to generally similar domain organization for each subfamily, animal and plant members of the same subfamily sometimes have distinct domains, including some domains that are unique to plants or animals (Fig. S14). For example, in the Suv subfamily, SRA and WIYLD were found only in plant SET proteins, whereas MBD, ANK, CHROMO and TUDOR only existed in animals (Fig. 5b). Similar situations were also observed for the Ash and Trx subfamilies (Fig. 5b). Compared with plant SET proteins, animal SET proteins contain more domains and have more complex domain organizations (Fig. 5b).

Both SETD and SMYD proteins have low levels of sequence identity in the SET domains; however, members of the same subfamily have the same domain organization (Fig. 8), supporting the hypothesis that the SETD and SMYD subfamilies each originated from the same ancestor before animal and plant divergence. The SETD proteins, but not other SET proteins, contain a Rubisco LSMT substrate-binding (Rubis-subs-bind: PF09273, RSB) domain, which allows the protein to bind to the N-terminal tails of histones H3 and H4 (Trievel et al., 2003). The Rubis-subs-bind domain also has highly divergent sequences, with a Pfam e-value ranging from 3.30 to 3.80e-37 (Fig. 8b). The SMYD proteins have a conserved insertion domain (i-SET) (Fig. 8a), the MYND finger domain (zf-MYND:PF01753), which is found in a large number of proteins with important roles in normal development or cancers, and has been shown to mediate protein–protein interactions, mainly in the context of transcriptional regulation (Leinhart & Brown, 2011). The zf-MYND domain exists in SMYD proteins of both plants and animals (Fig. 8a), indicating that this domain was probably inserted before the divergence of animals and plants.

Figure 8.

The domain architectures of SETD and SMYD proteins from humans (Hs) and Arabidopsis (AT). (a) The domain architectures of the full-length proteins were drawn based on the Pfam database. (b) The e-value of each domain was obtained using a HMMER search against the Pfam database.

Expression of SET genes in meiosis and comparison between duplicates

In Arabidopsis, almost all SET genes were transcribed according to RNA-seq results from five different developmental stages and tissues (Table S4), except for SUVH7 and SUVH10, which may be pseudogenes. Meiosis is essential for sexual reproduction and may require epigenetic histone modification; to assess the possible function of specific SET genes, we examined the meiotic expression of SET genes in Arabidopsis, rice and humans (Fig. S15). The results indicate that members of the Suv, Ash, Trx and E(z) subfamilies are expressed in meiosis, but each orthology group shows different expression patterns, suggesting that distinct orthology groups may play meiotic roles in different organisms.

In addition, to obtain clues regarding possible functional differences between recent paralogs, we compared gene expression patterns of recent paralogs in Arabidopsis, rice and humans, using Affymetrix microarray data and Arabidopsis and rice RNA-seq data. Most Arabidopsis, rice and human paralogs mainly have two different patterns: similar expression levels in developmental stages and tissues, suggesting functional conservation, or significantly differential expression, with one copy expressed at higher levels and the other lower or no expression in all stages and tissues (Figs S16, S17), suggesting different functions or silencing of one copy.


A model for the evolutionary history of the Suv, Ash, Trx and E(z) subfamilies

Our phylogenetic analysis of the SET gene family in plants, animals and fungi allows us to propose a scenario of the overall evolutionary history of SET genes (Fig. 6). In this model, there were at least three genes in each of the Suv, Ash, Trx subfamilies and one E(z) gene, with a total of 10 (3, 3, 3 and 1) genes, before the divergence of plants, animals and fungi. Subsequently, further duplication resulted in at least 14 SET genes (5, 4, 4 and 1 in Suv, Ash, Trx and E(z) subfamilies, respectively) in the common ancestor of chlorophytes and land plants; parallel independent expansion in animals resulted in 11 SET genes (4, 3, 3 and 1 in Suv, Ash, Trx and E(z) subfamilies, respectively) in the common ancestor of the vertebrates and invertebrates. In addition to these four subfamilies, at least 11 SET genes in three other subfamilies existed before the divergence of vertebrates and invertebrates. Additional SET gene duplication occurred after the divergence of vertebrates and invertebrates, and after the separation of chlorophytes and land plants. In addition, some Suv, Ash and Trx genes were lost from fungi and algae. In animals, most orthology groups in each subfamily had two or three vertebrate genes and one invertebrate gene, and had vertebrate syntenic evidence, strongly supporting the idea that many vertebrate SET genes are the result of genome duplication. Furthermore, numerous lineage-specific paralogs in land plants were probably also the result of WGD or segmental duplications. The independent yet similar duplication mechanisms in both vertebrate animal and land plants are striking, and suggest that such mechanisms are important for the evolution of SET genes.

This model presents an evolutionary history of the Suv, Ash, Trx and E(z) subfamilies in major eukaryotic lineages, and describes the extent and mechanisms of gene duplications in plants and animals. The number of genes varies both between and within major eukaryotic lineages, and has expanded significantly, probably as a result of genome duplication, providing strong support for an important role of genome duplication in shaping new gene functions and the characters they control. We suggest that such expansions might have also occurred in other gene families and may reflect a general theme in the evolution of gene repertoires of plants and animals. SET genes have been shown to play important roles in plant and animal development (Dambacher et al., 2010; Thorstensen et al., 2011). The expansion of SET genes by genome duplication during plant and animal evolution has probably facilitated the evolution of novel and complex traits and species diversity in both of these lineages. This study provides an excellent example supporting the general belief that genome duplication plays an important role in trait generation and speciation.

Our analysis shows that most orthology groups of SET genes were already present in early plants and animals, before or soon after the divergence of these major eukaryotic lineages. In addition, the fact that members of different subfamilies can methylate different sites on histone tails suggests that the early eukaryotes already had multiple histone methylation activities. The large range of multiple activities probably promoted the early evolution of plants, animals and fungi, which independently evolved from single cell forms into complex multicellular organisms. Further expansion of SET genes in land plants and vertebrates occurred when these organisms adapted to complex terrestrial environments, and the increased functional complexity of SET genes probably allowed versatile transcriptomic regulation that enabled adaptation to such environmental conditions. Similar evolutionary patterns were also found in families encoding transcription factors, such as the bHLH family, reflecting the functional diversification of early regulatory networks (Pires & Dolan, 2010).

Domain organization, gene expression and sequence evolution of recent duplicates

We found that many gene duplications occurred in the Suv, Ash, Trx and E(z) subfamilies, with potential changes in protein domain organization. Two main patterns following duplication were found: either one of the two duplicates had an altered organization, or both duplicates retained the ancestral organization. The evolution of domain organization was more easily detected in SET proteins in vertebrates and invertebrates, as most subgroups retained two or more copies in vertebrates, but had only one copy in invertebrates. From eight pairs of duplicates in humans, five (SETDB1/2, EHMT1/2, WHSC1/L1/NSD1, MLL2/3 and MLL/4) had the first pattern, whereas three (EZH1/2, SETD1A/B and SUV39H1/2) had the second pattern (Fig. 5b). These results suggest that duplicated SET genes might have either relatively conserved or divergent gene functions by retaining or changing domain organizations. The latter could result in distinct regulation interactions via the binding of different non-SET domains to other proteins. The known functions of these genes (Fig. 5c) indicate that most of the recent duplicates have maintained their functions. Changes in the patterns of tissue-specific expression can also produce a dramatic effect on the functions of paralogs. In a number of cases, duplicate genes have two different expression patterns (Figs S16, S17), suggesting that the duplicated genes have diverged in function (subfunctionalization or neofunctionalization). In particular, functional divergence is supported by the finding that Ezh2 catalyzes H3K27me2/3, but Ezh1 performs this function weakly and targets a subset of Ezh2-related genes (Margueron et al., 2008). In summary, retention of the original biochemical activities may be the primary fate of duplicated SET genes.

In addition to domain organization and expression pattern, the dN/dS ratio can provide clues on functional evolution, particularly for relatively recent gene duplicates. In the closely related humans and chimpanzees, the PRDM subfamily has evolved more rapidly than the other six subfamilies. Specifically, the PRDM9 and PRDM7 genes may have evolved under positive selection (dN/dS = 1.3168 and 1.1716, respectively). In A. thaliana and A. lyrata, the Suv subfamily has evolved more rapidly than the other subfamilies. In addition, the MEA gene essential for normal seed development has evolved very rapidly (dN/dS = 0.7022; Fig. 7) and has been reported to have been under positive selection in A. thaliana and A. lyrata (Spillane et al., 2007). These results suggest that PRDM9, PRDM7 and MEA have undergone neofunctionalization (Spillane et al., 2007; Oliver et al., 2009). It is interesting that PRMD9 and MEA resulted, respectively, from a recent duplication in primates or the WGD before the split of Arabidopsis thaliana and A. lyrata. Moreover, both duplicates from the same duplication have sometimes undergone rapid evolution, such as SUV420H1 and SUV420H2 in animals and SUVH7–8 and SUVR1-2 in plants (Fig. 7). These results suggest that SET duplicates can both undergo rapid functional diversification in closely related species.

Conservation and diversification of SET histone methylation activities

Depending on the site of modification, histone lysine methylation is associated with gene activation (H3K36 and H3K4) or silencing (H3K9 and H3K27) (Berger, 2007). Members of each of the Suv, Ash, Trx and E(z) subfamilies mainly modify one or two sites. In particular, members of the Trx and E(z) subfamilies specifically modify H3K4 and H3K27, respectively (Liu et al., 2010; Nimura et al., 2010; Thorstensen et al., 2011). Ash proteins mainly methylate at H3K36, but can also modify other sites, whereas Suv proteins primarily modify H3K9 (Fig. 5c), but can also modify other sites. For example, G9a histone methylation can occur at H3K9 and K27, and ASH1 modification can occur at H3K4, K9 and H4K20 (Dillon et al., 2005). Both Trx and E(z) subfamilies have evolved slowly in animals and plants (Figs 2, 3, 5a), and have conserved histone methylase functions (Liu et al., 2010). Assuming that highly conserved SET proteins shared similar biochemical activities, we can propose potential histone methylase activities for those members of each subfamily whose functions have not yet been analyzed. To date, all three members with reported activity in the Trx subfamily are from plants and have H3K4 methylase activity (Liu et al., 2010). Therefore, other Trx members from animals and other species might also have H3K4 methylase activity. In addition, E(z) subfamily members are known to have H3K27 methylase activity, suggesting that other E(z) proteins might be similar. The Suv and Ash subfamilies have multiple genes in plants and animals, and known activities include methylation of H3K9, H3K36, H3K4 and H3K27 (Fig. 5c), suggesting diversification of biochemical activities; nevertheless, members of the same orthogroup might still have similar activities.

Except for the four main subfamilies, other SET genes have maintained stable numbers during animal and plant evolution (Table 1); they may have highly conserved functions and are worthy of further investigation, such as ATXR3 (Berr et al., 2010; Guo et al., 2010) and SETD3, which have lysine specificity towards H3K36 (Kim et al., 2011). In addition to histone methylation, the SET proteins can also modify other proteins, such as p53 and NF-κB subunit RelA (Huang et al., 2006; Levy et al., 2011).

The various histone methyltransferase activities by members of different SET subfamilies can regulate gene expression, and thus development, in different ways, as suggested by previous studies (Dambacher et al., 2010; Thorstensen et al., 2011). For example, H3K9 methylation plays a role in DNA methylation-mediated gene silencing (Liu et al., 2010). The phylogenetic tree shows that, among the four subfamilies, the Suv subfamily differs the most between Arabidopsis and rice in terms of duplication and copy number, followed by Ash, Trx and E(z). In addition, the meiotic expression patterns of the Suv subfamily show the most differences between Arabidopsis and rice, compared with Ash, Trx and E(z). The different expression and evolutionary patterns of Suv imply that the function of Suv may have diverged to a greater extent between Arabidopsis and rice.

In yeast, the H3K4me3 mark by SET1 belonging to the Trx subfamily is enriched at double strand break (DSB) hotspots (Borde et al., 2009). In mice and humans, PRDM9 with H3K4me3 activity belonging to the PRDM subfamily is important for the regulation of meiotic recombination hotspots (Baudat et al., 2010; Parvanov et al., 2010). What are the possible roles of SET genes in plant meiosis? We investigated the Arabidopsis and rice SET gene expression in meiotic cells and found that most genes were expressed (Table S4). Interestingly, we found that some genes encoding putative methylases for H3K4me3 were expressed highly in Arabidopsis and rice meiotic cells. Furthermore, ATXR3 with H3K4me3 was expressed more highly in meiocytes than in other developmental processes (Table S4), suggesting that ATXR3 may play a preferential role in meiotic recombination. Indeed, most genes encoding H3K4me3 methylases are expressed in Arabidopsis and rice meiotic cells, suggesting that H3K4me3 mat affect meiotic recombination and hotspots in plants. This may even be a general eukaryotic regulatory mechanism and may have originated in the ancestor of eukaryotes.


We thank Yingxiang Wang, Yaqiong Wang, Liping Zeng and Hongyan Shan for comments on the manuscript and helpful discussions. This work was supported by the Chinese Ministry of Science and Technology (2011CB944600), and by funds from Fudan University (211 and 985 programs).