Gene clusters for the synthesis of secondary metabolites are a common feature of microbial genomes. Well-known examples include clusters for the synthesis of antibiotics in actinomycetes, and also for the synthesis of antibiotics and toxins in filamentous fungi. Until recently it was thought that genes for plant metabolic pathways were not clustered, and this is certainly true in many cases; however, five plant secondary metabolic gene clusters have now been discovered, all of them implicated in synthesis of defence compounds. An obvious assumption might be that these eukaryotic gene clusters have arisen by horizontal gene transfer from microbes, but there is compelling evidence to indicate that this is not the case. This raises intriguing questions about how widespread such clusters are, what the significance of clustering is, why genes for some metabolic pathways are clustered and those for others are not, and how these clusters form. In answering these questions we may hope to learn more about mechanisms of genome plasticity and adaptive evolution in plants. It is noteworthy that for the five plant secondary metabolic gene clusters reported so far, the enzymes for the first committed steps all appear to have been recruited directly or indirectly from primary metabolic pathways involved in hormone synthesis. This may or may not turn out to be a common feature of plant secondary metabolic gene clusters as new clusters emerge.
The organisation of genes into functional groupings is a common feature of bacterial genomes, where genes are frequently found in operons (clusters of co-transcribed genes with related functions) (Jacob and Monod, 1961; Zheng et al., 2002; Rocha, 2008; Koonin, 2009). With a few noteworthy exceptions [for example, in the nematode Caenorhabditis elegans (Blumenthal and Gleason, 2003)], true operons are rare in eukaryotes. Although clusters of functionally related genes do exist in eukaryotes, most of these consist of paralogues that have evolved by repeated tandem gene duplication and divergence. Examples include the mammalian Hox and β-globin loci, which are required for the embryonic patterning and synthesis of haemoglobin, respectively, and clusters of leucine-rich repeat genes associated with disease resistance in plants (Osbourn and Field, 2009).
Functionally related genes are commonly dispersed throughout the genome in eukaryotes, and can be co-ordinately regulated in trans. Until recently it was thought that gene order in eukaryotes was essentially random; however, it is becoming increasingly clear that this is not the case (Hurst et al., 2004; Osbourn and Field, 2009), and that there are numerous examples of clusters of functionally related but non-homologous genes in eukaryotic genomes. Perhaps the best known example is the major histocompatibility complex (MHC) in mammals (Horton et al., 2004). However, the vast majority of clusters described so far are for secondary metabolic pathways from filamentous fungi (Hoffmeister and Keller, 2007; Turgeon and Bushley, 2010). More recently, gene clusters for the synthesis of secondary metabolites have emerged as a new and growing theme in plant biology (Figure 1). These clusters are amongst the most diverse and rapidly evolving features of plant genomes, and as such provide excellent ‘read-outs’ for investigating genome plasticity and the mechanisms of adaptive evolution. Here, we will review current knowledge of plant metabolic gene clusters and explore the likely significance of clustering.
Operons and Gene Clusters in Prokaryotes and Eukaryotes
In order to put plant gene clusters into an evolutionary genomics context it is important to firstly consider what is already known about the nature of function-related gene ensembles in different organisms. We will therefore give a brief overview of different types of gene clusters in organisms across the kingdoms, by way of setting the scene.
Clusters of non-homologous genes are common in bacteria, where genes are often organized in operons (Zheng et al., 2002; Rocha, 2008; Koonin, 2009). An operon consists of a group of genes that are adjacent in the genome and that contribute to a common function: for example, the ability to catabolize particular substrates, swim or produce secondary metabolites. The genes within each operon are under the control of the same promoter and are co-transcribed together as a single polycistronic message. The first operon to be discovered was the lac operon, which was found in Escherichia coli by Jacob and Monod around 50 years ago (Jacob and Monod, 1961) (Figure 2). The lac operon contains a set of genes that enable E. coli to use the sugar lactose as a carbon source. The organization of functionally related genes into operons provides a mechanism that enables the genes within an operon to be co-expressed when the need arises (in this case, when E. coli needs to be able to use lactose as a sugar in order to grow). Under conditions where lactose is not available or the levels of glucose are high, expression of the lac genes is shut down.
Between 30 and 50% of genes in bacteria are organized in operons, indicating that this form of gene organization must confer strong evolutionary advantages (Zheng et al., 2002). Various hypotheses have been put forward regarding the origin and evolution of operons (e.g. Price et al., 2006; Fondi et al., 2009). Co-transcription of the genes within operons may favour lower gene expression noise and stoichiometric expression of the cognate gene products (Rocha, 2008; Kovács et al., 2009). The most highly conserved operons tend to code for components of protein complexes, although there are many exceptions to this rule (Dandekar et al., 1998; Pál and Hurst, 2004; Swain, 2004; Price et al., 2006). Spatial and temporal coupling of transcription and translation may also facilitate the co-ordination of post-translational processes such as compartmentalization and assembly of protein complexes through co-translational folding. In addition, physical clustering is expected to favour the fitness of a group of genes by facilitating co-inheritance. It is certainly the case that operons can be transmitted between bacteria by horizontal gene transfer, and this is proposed to be the primary origin of operons according to the selfish operon model (Lawrence and Roth, 1996; Lawrence, 1999). However, essential genes are commonly found in operons, as are other genes that are not known to be transmitted by horizontal gene transfer (Pál and Hurst, 2004; Rocha, 2008). The benefits of clustering for co-ordinate gene regulation may therefore provide a more likely explanation for the existence of operons (Price et al., 2006), although clearly co-inheritance (be it by vertical or horizontal transmission) will be critical for the persistence of the cluster in bacterial genomes.
Despite the large body of data available regarding structure, distribution and conservation, the biological significance and mechanisms(s) of operon formation are still under debate (Fondi et al., 2009). Foreign genes and complete operons may be introduced into bacterial genomes by horizontal gene transfer. However, operons consisting of native genes can also form de novo. The establishment of a new operon requires the assembly of multiple genes at a single locus in the bacterial genome. New operons can arise as a result of genomic rearrangements that bring more distant genes into proximity, or by the deletion of intervening genes (Fani et al., 2005; Price et al., 2006). As operon formation brings functionally related genes together, it is unlikely to be a neutral process, but rather a consequence of strong selection. Operon formation also requires the establishment and maintenance of co-ordinated regulation of this new gene set. It has been argued that co-regulation could be achieved more readily by modifying two independent promoters rather than by placing two genes in proximity. However, in situations where there is a need for the complex regulation of groups of genes, an operon with one complex promoter may be expected to arise more easily than two independent promoters. The dependence of several genes on a single regulatory sequence is likely to put this sequence under stronger selection, therby allowing for the emergence of more complex regulatory strategies. Operons do tend to have more complex conserved regulatory sequences compared with individually transcribed genes, consistent with this hypothesis (Hazkani-Covo and Graur, 2005; Price et al., 2005, 2006).
Most eukaryotic genes are transcribed separately, i.e. the mRNAs are monocistronic. Indeed, in the absence of the bacterial ploy of using complementarity between a Shine–Dalgarno sequence and the 3′ end of the 16S ribosomal RNA to recognize start codons, including those internal to transcripts, it was originally thought that operons could not exist in eukaryotes. However, this turns out to be an oversimplification. As many as 15% of the genes in the nematode C. elegans are transcribed as bi- or polycistronic mRNAs (Spieth et al., 1993; Zorio et al., 1994; Blumenthal and Gleason, 2003). Unlike the situation in prokaryotes, these C. elegans mRNAs are processed into monocistronic mRNAs by trans-splicing prior to translation. There are also reports of bi- or polycistronic transcripts in other eukaryotes, including tunicates, flagellate protozoa, Drosophila, mammals, filamentous fungi and plants, and several different transcript-processing strategies have been described. These various operon-like structures and associated processing mechanisms have probably arisen independently in different lineages. The small number of examples identified so far in the genomes of intensively studied organisms suggests that such gene organizations are relatively rare in eukaryotes (Hurst et al., 2004; Koonin, 2009; Osbourn and Field, 2009).
Prokaryotic gene clusters
Importantly, gene clustering in prokaryotes extends beyond the grouping of genes into operons. For example, the lacZYA operon is negatively regulated by the lacI repressor, which is encoded by the lacI gene (Figure 2). Although the lacI repressor gene is able to regulate expression of the lacZYA genes when placed anywhere in the E. coli chromosome, this gene is normally positioned immediately upstream of lacZYA. Co-localization of the lacI gene with the lacZYA operon may provide for the synthesis of the lacI repressor near the genes that it regulates, thereby facilitating the binding of the repressor to the lac operon operator sequence. Similar arguments have been proposed for the clustering of genes for transcription factors near the operons that they regulate, as this may enable the rapid binding of co-localized sites and tightly co-ordinated regulation (the ‘rapid search hypothesis’) (Price et al., 2005).
The genomes of bacteria such as the actinomycetes and myxobacteria contain numerous gene clusters for secondary metabolic pathways (Garcia et al., 2009; Hopwood, 2009a,b). Many of these compounds have useful biological properties, for example as antibiotics, immunosuppressants, anticancer agents, herbicides and insecticides. They presumably also have important ecological functions in nature (Osbourn, 2010a). In this context, a gene cluster is defined as a group of closely linked genes that are collectively responsible for the synthesis of a secondary metabolite (Osbourn, 2010a). These gene clusters are not single operons, but rather are clusters of operons and genes. The first complete biosynthetic gene cluster for an actinomycete antibiotic to be identified was the gene cluster for the blue-pigmented aromatic polyketide actinorhodin from Streptomyces coelicolor (Malpartida and Hopwood, 1984) (Figure 3a–c). The structure of this cluster is shown in Figure 3a. The 22 actinorhodin genes are organized as six (or possibly seven) operons. This is just one of numerous examples of gene clusters for secondary metabolic pathways in bacteria. Thus, although the pathway genes are not transcribed as a single operon there is evidently selection pressure for clustering. Actinomycetes have linear chromosomes (Lin et al., 1993). Although secondary metabolic gene clusters are occasionally found on large plasmids in actinomycetes (Mochizuki et al., 2003), in most cases they reside in the variable ‘arm’ regions close to the ends of the chromosome, whereas genes for essential functions such as DNA replication, transcription, translation and primary metabolism are generally found in the core of the chromosome (Challis and Hopwood, 2003). The location of gene clusters for synthesis of secondary metabolites close to the ends of chromosomes is a theme that will re-emerge later in this review.
Eukaryotic gene clusters
Eukaryotic genomes, like those of bacteria, also contain clusters of functionally related but non-homologous genes. It is important to emphasize that (by definition) these are not tandem arrays of duplicated genes, as the genes for the most part do not share sequence similarity. Conceptually it is perhaps easier to think of such clusters as ‘operon-like’, as the component genes are non-homologous but contribute to a common function, and the genes are physically linked and co-regulated (Osbourn and Field, 2009; Field and Osbourn, 2010). A critical difference, however, is that as far as we know the genes are all transcribed as separate mRNAs rather than as a single polycistronic mRNA (as is the case in bacteria). The known examples of non-homologous gene clusters in microbial eukaryotes and animals appear to be required for growth and survival under certain environmental conditions, and can therefore be regarded as adaptive gene clusters (Osbourn and Field, 2009; Osbourn, 2010a). In particular, they relate to the exploitation of new environments or the management of interactions with other organisms. Examples include clusters for catabolic pathways in yeast (Saccharomyces cerevisiae) that permit the use of new nitrogen or carbon sources, the yeast DAL and GAL clusters, respectively (Hittinger et al., 2004; Wong and Wolfe, 2005), and the MHC locus in animals, which is required for innate and adaptive immunity (interactions with pathogens in the environment) (Horton et al., 2004). Numerous examples of gene clusters have been discovered in filamentous fungi, where they are required for the synthesis of secondary metabolites (Hoffmeister and Keller, 2007; Turgeon and Bushley, 2010). Many of these fungal secondary metabolites have antibiotic activities, and so presumably suppress the growth of competing organisms in nature, whereas others are toxins that are important for the colonization of plant and animal hosts (for example, the Aspergillus flavus aflatoxin gene cluster) (Figure 3d–f).
One obvious assumption may be that these plant gene clusters have originated from microbes by horizontal gene transfer; however, there is very good evidence that this is unlikely to be the case. The genes and corresponding gene products for the first committed steps in these pathways can be regarded as signature genes/enzymes, as they are required for the synthesis of the skeleton structures of the different classes of secondary metabolite (Osbourn, 2010a). Considering these first, it is clear that the most likely origin of these genes is from plants rather than microbes. The signature genes all share homology with genes from plant primary metabolism, and so are likely to have been recruited directly or indirectly from primary metabolism by gene duplication and acquisition of new functions (as shown in Figure 4). Alternatively, it is possible that the primary metabolic genes and their signature gene counterparts evolved from a common ancestor.
The first committed step in the synthesis of the cyclic hydroxamic acids DIBOA and DIMBOA is the conversion of indole-3-glycerol phosphate to indole, which is catalysed by the tryptophan synthase α (TSA) homologue BX1. Bx1 is likely to have originated by duplication of the maize gene encoding TSA (Frey et al., 1997). TSA and BX1 are both chloroplast-localized indole-3-glycerol phosphate lyases. TSA forms a complex with the β-subunit of tryptophan synthase TSB to convert indole-3-glycerol phosphate to tryptophan, whereas BX1 functions as a monomer and produces free indole, which is then further metabolized to DIBOA and DIMBOA glucosides by other pathway enzymes (Gierl and Frey, 2001; Frey et al., 2009).
The first steps in the thalianol and avenacin pathways involve the cyclization of 2,3-oxidosqualene to distinct triterpene products, mediated by enzymes known as oxidosqualene cyclases (OSCs) (Haralampidis et al., 2001; Field and Osbourn, 2008). In A. thaliana 2,3-oxidosqualene is converted to thalianol by the OSC thalianol synthase (THAS), whereas in oat the same substrate is cyclized to the pentacyclic triterpene β-amyrin by the OSC β-amyrin synthase (SAD1 or AsBAS1). Thalianol and β-amyrin are then further modified by downstream pathway enzymes (Papadopoulou et al., 1999; Qi et al., 2006; Field and Osbourn, 2008; Mylona et al., 2008; Mugford et al., 2009; Qin et al., 2010). THAS and Sad1 are likely to have originated independently in the dicot and monocot lineages by duplication of cycloartenol synthase genes (Field and Osbourn, 2008). Cycloartenol synthase is an OSC that is required for primary metabolism, for the synthesis of essential sterols and steroid hormones.
The genes for the signature enzymes of the two diterpene clusters (for phytocassane and momilactone synthesis) are likely to have originated by recruitment from the gibberellin pathway (Sakamoto et al., 2004; Wilderman et al., 2004; Shimura et al., 2007; Swaminathan et al., 2009). The first step in these pathways involves cyclization of the diterpene precursor (E,E,E)-geranylgeranyl diphosphate (GGPP) by copalyl/labdadienyl diphosphate synthases (CPSs) to either ent-copalyl diphosphate (by OsCPS2 for phytocassane synthesis) or its isomer syn-copalyl diphosphate (by OsCPS4 for momilactone synthesis). These CPS enzymes are known as class-II diterpene synthases. A second set of enzymes (the kaurene synthase-like, KSL, enzymes, or class-I diterpene synthases) then carry out a further cyclization step, as shown in Figure 4 (Peters, 2006; Toyomasu, 2008). As for the other pathways, these intermediates are subsequently modified by downstream enzymes. The CPS and KSL enzymes of the phytocassane and momilactone pathways are believed to have been recruited from OsCPS1 and OsKS1, diterpene synthases that are required for the synthesis of gibberellins in rice (Swaminathan et al., 2009). The synthesis of sterols and triterpenes occurs via the mevalonate pathway, which is localized in the cytosol/endoplasmic reticulum (Chappell, 2002). In contrast, diterpene synthesis proceeds via a mevalonate-independent pathway, and takes place in the plastids (Eisenreich et al., 1998; Lichtenthaler, 1999; Rohmer, 1999).
The composition of the clusters
In addition to these signature genes the clusters also contain genes for the tailoring enzymes required for the further elaboration of the skeleton structures, including cytochrome P450s and other oxidoreductases, acyltransferases, methyltransferases and sugar transferases (Osbourn, 2010a). These enzymes, some but not all of which have been characterized, mediate the successive conversion of the products of the signature enzymes into the respective pathway end products (Figure 5). The gene content of the five plant clusters can be seen in Figure 1. Although these clusters consist primarily of genes that do not share sequence homology, and that encode different classes of biosynthetic enzymes, there is evidence of relatedness between cluster components in some cases. For example, the maize DIMBOA cluster contains genes for four closely related but highly substrate-specific cytochrome P450s that each catalyse a different step in the modification of indole to DIBOA (Frey et al., 1995, 1997). These CYP450s, which belong to the CYP71C subfamily, have almost certainly arisen from a common ancestor by tandem gene duplication and acquisition of new functions. Thus, there is a place for tandem gene duplication in the evolution of some components of these clusters, although importantly the clusters also contain other non-homologous genes for different classes of biosynthetic enzymes (Figure 1).
The avenacin and thalianol clusters are both required for triterpene synthesis, and contain genes for OSCs, CYP450s and acyltransferases (although the avenacin cluster contains genes for additional classes of biosynthetic enzyme as well). Thus, there are some superficial similarities between these two clusters. However, phylogenetic analysis indicates that the avenacin and thalianol clusters are monocot- and eudicot-specific, respectively, and that these clusters have evolved recently and independently in the two plant lineages (Field and Osbourn, 2008). This suggests that selection pressure may act during the formation of certain plant metabolic pathways to drive gene clustering, and that triterpene pathways are predisposed to such clustering. Similarly, the two rice diterpenoid biosynthetic gene clusters appear to have undergone independent evolution, although there could have been an ancestral cluster that contained common precursors to some of the current gene members. In this event substantial further assembly would still have been required to produce the momilactone and phytocassane clusters (Swaminathan et al., 2009).
The ability to produce cyclic hydroxamic acids is not restricted to maize. Within the Gramineae, Triticum spp. (wheat), Secale cereale (rye) and certain wild Hordeum (barley) species are able to synthesize these compounds, whereas oats, rice and cultivated barley are not (Niemeyer, 1988; Sicker et al., 2000). The Bx gene cluster is believed to be of ancient origin. Interestingly, the cluster is split in wheat and rye, but still appears to be functional. The splitting of the cluster has been attributed to a reciprocal translocation event in a common ancestor of the two cereals (Nomura et al., 2002, 2003). Some eudicots also produce cyclic hydroxamic acids, in particular DIBOA and its glucoside. These include certain members of the Ranunculales, such as Consolida orientalis (larkspur) and Lamium galeobdolon (yellow archangel), and Acanthus squarrosa (zebra plant, which belongs to the order Lamiales). Comparative analysis of the BX1 enzymes of grasses and DIBOA-producing eudicots indicates that these enzymes do not share a common monophyletic origin, and it seems likely that the DIBOA/DIMBOA pathway has evolved independently in grasses and eudicots (Schullehner et al., 2008).
Gene clusters for the synthesis of secondary metabolites in bacteria and filamentous fungi often (but not always) contain genes for transporters that confer resistance to the pathway end product, and also for pathway-specific regulators (Osbourn, 2010a). Transporters for the five clustered plant pathways have not yet been identified, although Lr34, a gene that has recently been cloned from wheat that is associated with broad-spectrum disease resistance, is predicted to encode an ATP-binding cassette transporter implicated in the transport of defence compounds, and may conceivably form part of a secondary metabolic gene cluster for the synthesis of defence compounds (Krattinger et al., 2009; Osbourn, 2010b). So far the only transcription factor that has been identified for plant secondary metabolic gene clusters is OsTGAP1, a positive regulator of both momilactone and phytocassane synthesis in rice (Okada et al., 2009). This is a basic leucine-zipper transcription factor that induces diterpene synthesis in response to chitin oligosaccharide elicitor treatment. The gene for this transcription factor is not co-located with either of the clusters that it regulates. It is not known whether OsTGAP1 also regulates other secondary metabolic pathways in rice.
What is the Significance of Clustering?
Whereas clustering of genes for secondary metabolic pathways is more or less the norm in filamentous fungi, it is clearly not always the case in plants. Why should genes for some pathways be clustered, whereas those for other pathways (such as anthocyanin synthesis) are not? Two primary advantages can be envisaged: co-inheritance and co-regulation (Osbourn, 2010a,b). These are considered below.
Clustering of functionally related genes will facilitate the co-inheritance of favourable combinations of alleles at these multigene loci. Given that the complete cluster is required for the synthesis of the pathway end product, and that the products of the five plant gene clusters are all implicated in plant defence, the formation and maintenance of the most effective gene set is expected to be beneficial, and this may lead to selection for genomic rearrangements that reduce the distance between the pathway genes (Nei, 1967, 2003). Importantly, disruption of plant secondary metabolic gene clusters can lead not only to a loss of the protective pathway end product, but also to the accumulation of toxic/bioactive intermediates. For example, the accumulation of triterpene pathway intermediates in oat and A. thaliana can result in severe effects on growth and development (Field and Osbourn, 2008; Mylona et al., 2008; Osbourn, 2010b) (Figure 6). Thus, disruption of these pathways can have detrimental effects in addition to compromized pest and disease resistance, and this may further enhance selection for clustering.
Transcriptional co-regulation of genes can be influenced at different levels (Hurst et al., 2004; Sproul et al., 2005). Tandem duplicates may have similar expression patterns because they have similar promoters, whereas in some cases co-expression of adjacent genes that are encoded on opposite DNA strands may be co-regulated by a common bidirectional promoter. Fusion of linked genes with related functions to make a single protein product represents a further extreme mechanism of achieving co-expression (e.g. Hawkins, 1987; Zhang and Smith, 1998; Gross et al., 2006). None of the above is sufficient to fully account for the genomic organisation shown in Figure 1. However, a further benefit of gene clustering is that it may provide various other mechanisms for co-ordinated transcriptional regulation, including shared long-distance regulatory elements, or regulation through chromatin structure or nuclear organization (Hurst et al., 2004; Sproul et al., 2005; Janga et al., 2008; Osbourn and Field, 2009; Osbourn, 2010a,b). The grouping of genes into clusters may also facilitate the co-ordinated handling of transcripts that have arisen from physically linked genes, from transcription through processing and export to protein synthesis. In yeast, filamentous fungi and animals it is evident that gene cluster expression is associated with various histone modifications, and is mediated by chromatin remodelling complexes (reviewed in Osbourn and Field, 2009) (Figure 7). Non-coding RNAs have also been implicated in the recruitment of chromatin complexes, and in animals Hox gene expression can be controlled post-transcriptionally, and probably also epigenetically, by non-coding RNAs and Polycomb group proteins (Yekta et al., 2008; Fraser et al., 2009).
In filamentous fungi, regulation of secondary metabolic clusters at the level of chromatin is well established, and exploitation of drugs and mutations that affect chromatin remodelling is proving to be a very effective means of pathway discovery (e.g. Bok et al., 2009; Cichewicz, 2010). Intriguingly, histone-like proteins that modulate the transcription of contiguous sets of genes, including those for secondary metabolic pathways, have recently been discovered in the actinomycete Streptomyces coelicolor (McArthur and Bibb, 2006, 2007). Evidence is emerging from the thalianol and avenacin clusters, both of which are developmentally regulated, to implicate chromatin remodelling in the regulation of plant secondary metabolic clusters. The thalianol cluster has marked histone H3 lysine 27 (H3K27) trimethylation, suggestive of coordinated regulation at the chromatin level (Field and Osbourn, 2008) (Figure 7). In oat, high-resolution DNA fluorescence in situ hybridization (FISH) experiments have revealed that expression of the avenacin cluster is associated with chromatin decondensation (Wegel et al., 2009).
What are the Evolutionary Origins of Plant Secondary Metabolic Gene Clusters?
Horizontal gene transfer represents one potential source of gene clusters, and indeed has been proposed to be the primary source of such clusters in the ‘selfish operon/cluster’ model (Lawrence and Roth, 1996; Lawrence, 1999; Walton, 2000). It has recently been reported that the GAL cluster of Schizosaccharomyces yeasts is likely to have been acquired through horizontal gene transfer from a Candida yeast (Slot and Rokas, 2010), and there is evidence to suggest that several fungal gene clusters associated with catabolic and secondary metabolic pathways have also spread by horizontal transmission (Patron et al., 2007; Slot and Hibbett, 2007; Khaldi et al., 2008). However, the genomes of microbes and plants are highly dynamic, and there is a growing body of evidence to indicate that clusters can form de novo by recruitment and diversification of existing genetic components. Various models for this can be envisaged. The same paper that reports the horizontal gene transfer of a GAL cluster between two different yeasts also draws on the extensive body of genome sequence information available for different yeasts to show that multiple GAL pathway gene clusters have evolved independently and by different mechanisms. Whereas the origin of the Schizosaccharomyces GAL cluster is likely to have been through horizontal gene transfer from Candida, the cluster that is common to Saccharomyces and Candida appears to have formed through the relocation of native unclustered genes. In contrast, the GAL cluster of Cryptococcus yeasts seems to have been assembled independently from Saccharomyces/Candida and Schizosaccharomyces GAL clusters, and co-exists in the Cryptococcus genome with unclustered GAL paralogues (Slot and Rokas, 2010). Similarly, the DAL cluster in Saccharomyces cerevisiae is also likely to have arisen by genomic rearrangements (Wong and Wolfe, 2005). Gene duplications are regarded as one of the central mechanisms for the origin of new genes (Ober, 2010). Where single-copy genes exist within metabolic gene clusters, and appear to have relocated to their new positions in the cluster, the underlying events may have involved gene duplication followed by the loss of the gene at the original locus. In other cases the presumed progenitor genes can be observed in their original genomic locations. In filamentous fungi there is also evidence that existing secondary metabolic clusters can be augmented by the recruitment of additional genes that are sourced from elsewhere in the genome (Proctor et al., 2009). The grouping of clusters together in ‘superclusters’ in both prokaryotes and eukaryotes may provide a framework for linking different multi-step processes together (Trefzer et al., 2002; Ehrlich et al., 2005; Perrin et al., 2007; Fischbach, 2009; Osbourn, 2010a).
For the plant secondary metabolic gene clusters described here, there is good evidence that the genes within the clusters are of plant rather than microbial origin. Duplication of whole genomes or of segments of genomes represents a potential mechanism of cluster duplication. However, this does not explain how clusters originate de novo. It seems likely that the genes for the signature enzymes for the five pathways have been recruited directly or indirectly from primary metabolism by gene duplication and acquisition of new functions (Figure 4), or alternatively that the signature genes and their primary metabolism counterparts arose from a common ancestral gene. Could it be that genes for primary metabolism are clustered in plants, and that such clusters have provided a template for new divergent secondary metabolic clusters by cluster duplication? Computational analysis of the Arabidopsis thaliana genome has suggested that genes for some primary metabolic pathways are more tightly linked than would be expected by chance, but that most of these clusters are very loose, in some cases spanning nearly one-quarter of a chromosome (Lee and Sonnhammer, 2003). Alternatively, once a gene for a signature enzyme has arisen by duplication and functional divergence, this gene may then ‘seed’ the formation of a secondary metabolic cluster. The genes that are recruited to this cluster could be relocated from elsewhere in the genome, as appears to occur in some cases in the yeast DAL and GAL clusters (Wong and Wolfe, 2005; Slot and Rokas, 2010). It is also possible that the other cluster components arise by duplication of genes elsewhere in the genome, followed by the acquisition of new functions, as for the signature genes. Thus, the raw material for gene clusters may be garnered from around the genome, as appears to be the case for the DAL gene cluster (Wong and Wolfe, 2005). In bacteria it is evident that operons can form through rearrangements that bring more distant genes into proximity. In some instances this may involve compaction of regions containing genes that are functionally relevant to the cluster by deletion of the intervening irrelevant genes (Price et al., 2006), a process that we have previously termed ‘genome defragmentation’ (Osbourn, 2010a). This could represent another mechanism by which genes are brought together in plants, although there is at least some evidence that runs counter to this argument. For example, the presumed progenitors of the genes for the first two steps in the avenacin pathway are the sterol biosynthesis genes cycloartenol synthase (an oxidosqualene cyclase) and obtusifoliol 14α-demethylase (a cytochrome P450 belonging to the CYP51 family) (Haralampidis et al., 2001; Qi et al., 2004, 2006). These two sterol biosynthesis genes are not linked to each other (or to the avenacin cluster) in oat (Qi et al., 2004, 2006), and there is no evidence for linkage of these highly conserved cycloartenol synthase and obtusifoliol 14α-demethylase genes in rice or in Arabidopsis thaliana.
As discussed earlier, the selection for formation of plant secondary metabolic clusters is presumably intense, driven by the need to rapidly adapt to environmental change. Like the actinomycete clusters, fungal secondary metabolic gene clusters are often found close to the ends of chromosomes (Hoffmeister and Keller, 2007). The subtelomeric regions of eukaryotic chromosomes are highly dynamic, facilitating non-allelic recombination, DNA inversion, partial deletions, translocations and other rearrangements, and also enabling conditional expression of genes through subtelomeric gene silencing mechanisms. Relocation of genes into subtelomeric regions may therefore enable rapid gene divergence because of local elevated genome dynamics (Brown et al., 2010). The maize DIMBOA cluster is located close to the end of chromosome 4 (Figure 8a). However the Arabidopsis thaliana thalianol cluster and the two rice clusters are not in subtelomeric regions. The oat genome has not been sequenced, but the avenacin gene cluster maps to the end of a linkage group (Qi et al., 2004), and DNA FISH suggests that this cluster may also be subtelomeric (Figure 8b). Repetitive DNA sequences and transposable elements are often associated with secondary metabolic clusters in filamentous fungi (Perrin et al., 2007; Shaaban et al., 2010), and may have functions in cluster formation and regulation in plants.
The first plant secondary metabolic pathway to be discovered was the maize DIMBOA cluster, over 10 years ago (Frey et al., 1997). Since then four more clusters have been reported, and secondary metabolic clusters are now an emerging theme in plant biology. The examples so far are for cyclic hydroxamic acids, triterpenes and diterpenes, all of which have distinct biogenic origins. Other putative terpene clusters, for example in Arabidopsis thaliana and in rice, have been noted (Aubourg et al., 2002; Sakamoto et al., 2004; Ehlting et al., 2008; Field and Osbourn, 2008). The signature genes for the five pathways reported so far share common evolutionary histories with genes from primary metabolic pathways involved in the synthesis of, amongst other things, plant growth hormones (Figure 4). Might this be a trend? As the volume of available genome sequence information for different plant species continues to grow, it will be possible to determine how common secondary metabolic clusters are in plant genomes, and what kinds of compounds these clusters produce. Other major challenges are to investigate how these clusters form, both at the level of the individual organism and in populations, and to understand the significance of clustering. Improved knowledge of secondary metabolic gene clusters has practical applications, both for natural product discovery and for the development of plants with improved/novel metabolic traits for agricultural and industrial uses.
Anne Osbourn acknowledges the Biotechnology and Biological Sciences Research Council (BBSRC), the Engineering and Physical Sciences Research Council, the European Union and the Royal Society for funding. Hoi Yee Chu is supported by a BBSRC Doctoral Training Grant. We would like to thank David Hopwood, Cathie Martin, Paul O’Maille and Stephan Bornemann for the useful discussions that we have had during the preparation of this article, and other colleagues for providing valuable comments and information. The image shown in Figure 8b was generated by Eva Wegel as part of work funded by Biotechnology and Biological Sciences Research Council grant no. BB/C504435/1, a joint award to Anne Osbourn and Peter Shaw (John Innes Centre).