Tandem duplication of the FLC locus and the origin of a new gene in Arabidopsis related species and their functional implications in allopolyploids


  • Gyoungju Nah,

    1. Section of Molecular Cell and Developmental Biology
    Search for more papers by this author
  • Z. Jeffrey Chen

    1. Section of Molecular Cell and Developmental Biology
    2. Institute for Cellular and Molecular Biology
    3. Section of Integrative Biology
    4. Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, TX 78712, USA
    Search for more papers by this author

Author for correspondence:
Z. Jeffrey Chen
Tel: +1 512 4759327
Email: zjchen@mail.utexas.edu


  • Flowering time is an important adaptive trait and varies among Arabidopsis thaliana and its related species, including allopolyploids that are formed between A. thaliana and Arabidopsis arenosa. FLOWERING LOCUS C (FLC) inhibits early flowering in A. thaliana. A previous study showed that late-flowering A. arenosa contained two or more FLC alleles that were differentially expressed in Arabidopsis allotetraploids, but the genomic organization and evolution of FLC locus were unknown.
  • Comparative sequence and evolutionary analyses were performed in FLC-containing genomic regions in A. thaliana, A. arenosa and Arabidopsis lyrata, and expression of FLC loci and alleles was examined in Arabidopsis allopolyploids.
  • The FLC locus was tandemly duplicated in A. lyrata and triplicated in A. arenosa, and the tandem duplication event occurred after divergence from A. thaliana. Although FLC duplicates were highly conserved, their upstream sequences rapidly diverged. The third FLC copy in A. arenosa acquired a new splicing site through a point mutation in the intron and generated the new exon followed by an early stop codon, resulting in a novel MADS box gene.
  • Flowering time variation in Arabidopsis allopolyploids is probably related to the expression diversity and/or copy number of multiple FLC loci. Moreover, exonization of intronic sequence is a mechanism for the origin of new genes.


Flowering time is an important adaptive trait. Plants respond to external growth conditions such as the seasonal changes to regulate flowering time (Simpson & Dean, 2002). In Arabidopsis, there is natural variation in flowering time among Arabidopsis thaliana and its closely related species, Arabidopsis arenosa and Arabidopsis lyrata that diverged from A. thaliana c. 5 million yr ago (Mya) (Koch et al., 2000). Arabidopsis arenosa is a short-lived perennial and grows from central to northern Europe (Clauss & Koch, 2006). A. lyrata is a perennial and niches from sea level to 1.5 km of elevation in cold to mild climatic regions of the Northern Hemisphere. Flowering in Arabidopsis is affected by environmental cues such as cold temperature during winter and increased temperature and daylength during spring and summer (Simpson & Dean, 2002). The flowering time of A. lyrata populations ranges from 75 to 175 d without vernalization (prolonged cold treatment) (Riihimaki et al., 2005). Arabidopsis arenosa requires c. 60 d to flower. Both species are late flowering compared with A. thaliana, which usually flowers within c. 30 d after germination. This suggests that flowering time is adaptive to environmental cues and ecological niches.

Flowering variation is controlled by several mechanisms, including the autonomous pathway that involves FLOWERING LOCUS C (FLC) and FRIGIDA (FRI) (Simpson & Dean, 2002). FLC encodes a MADS box transcription factor that represses transition to flowering in A. thaliana (Michaels & Amasino, 1999). FLC is controlled by its positive regulator FRI (Johanson et al., 2000) and the negative regulator LUMINIDEPENDENS (LD) (Lee et al., 1994). FLC expression is also affected by environmental cues, such as vernalization that promotes flowering by repressing FLC (Michaels & Amasino, 1999). Among A. thaliana ecotypes, strong and weak alleles of FLC are associated with different flowering time because of the insertion of transposon in the first intron (Gazzani et al., 2003; Michaels et al., 2003). No null allele or copy-number variation of FLC has been identified in A. thaliana ecotypes. While many A. thaliana ecotypes including Columina and Ler do not contain functional FRI, A. lyrata and A. arenosa have a functional FRI (Wang et al., 2006a; Kuittinen et al., 2008). The synthetic Arabidopsis allotetraploids contain two sets of FLC originating from A. thaliana and A. arenosa, respectively, and flower late. Inhibition of early flowering is caused by upregulation of A. thaliana FLC (AtFLC) that is trans-activated by A. arenosa FRI (AaFRI). Moreover, two duplicate FLCs (AaFLC1 and AaFLC2) originating from A. arenosa are expressed in some allotetraploids but silenced in other lines. The strong AtFLC and AaFLC loci are maintained in natural Arabidopsis allotetraploids, leading to extremely late flowering. FLC expression variation in allotetraploids is controlled by histone H3-Lys4 methylation and H3-Lys9 modifications. The multiple FLC transcripts of A. arenosa (Wang et al., 2006a) are derived from two FLC genomic locations within a Bacterial Artificial Chromosome (BAC) (Ha et al., 2009), indicating that the FLC locus might have been tandemly duplicated in the A. arenosa genome.

In this study, the duplication pattern and copy number variation of the FLC locus were further investigated in A. thaliana and its related species, A. arenosa and A. lyrata. The FLC genomic regions were comparatively analysed between A. thaliana, A. arenosa and A. lyrata whose genome has been recently sequenced. We found tandem duplication events of FLC in A. arenosa and A. lyrata, analysed gene structures and sequence evolution of FLC orthologes, and determined genomic features of regulatory elements in FLC duplicates. A novel MADS box gene (named AaFLC3) was discovered in the tandemly duplicated region of FLC in A. arenosa. These genomic features will facilitate functional analysis of the novel MADS box gene and FLC duplicates in flowering time diversity among Arabidopsis related species and in allotetraploids.

Materials and Methods

Plant materials

Arabidopsis allotetraploids (Allo733) were resynthesized between A. thaliana autotetraploid (Ler, accession no. CS3900) and A. arenosa tetraploid (accession no. CS3901) (Wang et al., 2006b). Seeds from the allotetraploids and parental lines as well as A. thaliana autotetraploid (Col, accession no. 1757) were germinated in Murashige and Skoog basal salt (Murashige & Skoog, 1962) including vitamin, 3% (w : v), sucrose, and 0.8% (w : v) agar at 21°C under long-day conditions (16 h light and 8 h dark). After 2 wk, seedlings were transferred to a pot containing vermiculite–soil mix and grown under the same long-day conditions. Leaves were harvested from A. thaliana (Col and Ler), A. arenosa and Allo733 at 4 wk and 8 wk, respectively, for RNA isolation.

Sequence analysis

The FLC-containing BAC (146 404 bp) in A. arenosa was sequenced (GenBank accession no. FJ461780) (Ha et al., 2009). The genomic regions of 122 864 bp and 170 441 bp that correspond to the A. arenosa BAC sequence were parsed from A. thaliana (Col) (NCBI, Bethesda, MD, USA) and A. lyrata (DOE Joint Genome Institute, JGI, Walnut Creek, CA, USA), respectively. Sequence alignment to A. thaliana (Col) reference genome was made using VISTA genome browser (http://genome.lbl.gov/vista/index.shtml). The GC-content was measured using EMBOSS program geecee (http://emboss.bioinformatics.nl/cgi-bin/emboss/geecee). Dot-plot comparison was analysed using zpicture (http://zpicture.dcode.org/) and dotter (Sonnhammer & Durbin, 1995). The genomic sequences were annotated using fgenesh (http://www.softberry.com), a gene prediction software program for A. thaliana genome annotation (Swarbreck et al., 2008). The predicted genes were proofed by manual annotation editing using the expressed sequences tag (EST) information from A. thaliana RefSeq RNA database and nonredundant protein database from NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi). Repetitive elements were searched against plant repeat database from the Genetic Information Research Institute (GIRI, http://www.girinst.org/repbase/update/index.html). Annotation of both gene and repetitive elements was displayed and analysed using Apollo Genome Annotation Curation Tool version 1.11.1 (http://apollo.berkeleybop.org/current/index.html). The sequence analysis was aided by blastn, blastp, tblastn, blastx and 2-blast (http://www.ncbi.nlm.nih.gov/BLAST/). The construction of phylogenetic trees and test of Ka : Ks values were performed using mega4 (http://www.megasoftware.net/). The Nei–Gojobori method with Jukes–Cantor model (Jukes & Cantor, 1969) was used for testing Ka : Ks values, and a neighbor joining method (Saitou & Nei, 1987) with a bootstrap value of 1000 was used for constructing respective phylogenetic trees.

Expression analysis

Total RNA was extracted from rosette leaves of A. thaliana (Col and Ler), A. arenosa and Allo733 using Plant RNA Purification Reagent (Invitrogen) followed by DNase (Promega) treatment. Expression of FLC loci and alleles was analysed using 1 μg of total RNA in a reverse transcriptase polymerase chain reaction (RT-PCR) assay containing SuperScript III/Platinum Taq mix (Invitrogen) and the primer pair. The primer pair for AaFLC1 and AaFLC2 amplification and restriction assays was: AaFLC_RT1_for CAGCTTCTCCTCCGGCGATAACC-TGGT and AaFLC_RT1_rev GGAGAGTCACCGGAAGATTGTCGGAGA; the primer pair for AaFLC3 expression assays was forward, 5′-ATGGGGAGTAAAAAACT-AGAAATCAAG3, and reverse, 5′-CGTAAGTGCATTGCCTTATCGCCG-3′; and the primer pair for ACTIN expression control was forward, 5′-TTGACCTTGCTG-GACGTGACCTTA-3, and reverse, 5′-ACAATTTCCC-GCTCTGCTGTTGTG-3′. Reverse transcription reaction was performed at 50°C for 30 min followed by 94°C for 2 min and 35 cycles of PCR reactions (94°C for 15 s, 55°C for 30 s, and 68°C for 1 min with the last extension of 68°C for 5 min).


Comparative sequence analysis and detection of tandem duplication at the FLC locus

Microcolinearity and sequence organization of the FLC-containing genomic regions were analysed in A. arenosa, A. lyrata and A. thaliana (Col) reference genome using dotter, zpicture, and vista. The overall colinearity was well-maintained in all three species (Fig. 1a, see the Supporting Information, Fig. S1). The total GC-content in the regions was 35% in A. thaliana and 36% in both A. arenosa and A. lyrata, which is slightly higher than 34.5% of overall GC content in chromosome 5 of A. thaliana (Tabata et al., 2000). Despite the overall microcolinearity, distinct local rearrangements were present in A. arenosa and A. lyrata (Fig. 1a–c), showing typical tandem duplication patterns. These duplicate regions corresponded to the FLC locus in A. thaliana, indicating that more than one copy of FLC is present in A. arenosa and A. lyrata. To investigate the copy number of FLC in this region, the genomic region of A. thaliana FLC (AtFLC) AtFLC_NCBI (122 864 bp) was aligned with AaFLC_NCBI (146 404 bp) and AlFLC_JGI (170 441 bp) of A. thaliana, A. arenosa and A. lyrata sequences, respectively. The Dot plots showed two intact FLC copies in A. lyrata and two intact copies and one partial copy in A. arenosa, whereas only one copy of FLC was present in A. thaliana (Fig. 1d–f).

Figure 1.

 Genomic sequence analysis using dot-plot alignment. (a–c) Genomic sequence comparison in the vicinity of FLOWERING LOCUS C (FLC) locus between (a) Arabidopsis thaliana (AtFLC_NCBI) and Arabidopsis arenosa (AaFLC_NDBI), (b) A. thaliana and Arabidopsis lyrata (AlFLC_JGI), (c) A. arenosa and A. lyrata. The duplicate regions are shown in boxes. (d–f) Copy number identification of the FLC locus in three related species. The genomic sequence containing 5.6 kb of AtFLC (At5g10140) was aligned against (d) AtFLC_NCBI, (e) AlFLC_JGI, (f) AaFLC_NDBI.

Annotation, microsynteny and local disruption

Comparative analysis of the gene order, orientation, and sequence homology indicated that the overall synteny in the vicinity of the FLC locus was highly maintained. Several local rearrangements, including tandem duplication, gene fragmentation and insertion of noncolinear genes and transposable elements were identified between A. thaliana and A. arenosa and between A. arenosa and A. lyrata (Koch et al., 2000). Thirty-six putative genes including FLC were identified in A. arenosa and A. lyrata (Fig. 2a, Table S1). The average gene size (ATG to stop codon) was 1912 bp in A. arenosa and 1928 bp in A. lyrata, which is larger than the average gene size (1803 bp) in this region in A. thaliana. Two intact copies of FLC in A. lyrata and A. arenosa were annotated as ORF26-1 and ORF26-2 (Figs 1e,f, 2a). Based on mRNA sequences of two A. arenosa FLC transcripts (Wang et al., 2006a), ORF26-1 and ORF26-2 corresponded to AaFLC2 and AaFLC1, respectively. In addition, A. arenosa contained a partial copy of FLC (ORF26-3) in the tandem duplicate array. This truncated FLC-like gene was named AaFLC3. In A. lyrata, ORF26-1 and ORF26-2 were designated AlFLC2 and AlFLC1, respectively.

Figure 2.

 Genomic organization and tandem duplication patterns in the vicinity of FLOWERING LOCUS C (FLC) loci. (a) Graphic display of gene annotation. FLC tandem duplicates were found in Arabidopsis arenosa (AaFLC1, AaFLC2 and AaFLC3) and in Arabidopsis lyrata (AlFLC1 and AlFLC2). Blocks A and B indicate segmental tandem duplication of three adjacent genes (ORF25, ORF26 and ORF27). Collinear genes (blue), noncolinear genes (pink), transposable elements (green), FLC and its duplicates (red), FLC neighboring duplicates (black) and gene fragments (black dashed lines) are shown. Sequential numbers indicate annotated genes (blue) except for transposable elements (green). (b) Origin of AaFLC3 gene structure. Boxes indicate tandemly duplicated regions sharing sequence homology. A new splicing codon (AG) and a stop codon (TAA) were shown in the second exon of AaFLC3. Dashed lines indicates intergenic region. Arrows indicate the gene orientation.

The FLC locus was duplicated in both A. arenosa and A. lyrata, but the duplication patterns were different. In A. arenosa, the duplication was limited to the FLC locus, excluding neighboring genes. In A. lyrata, the duplication involved FLC and two adjacent genes (arrowed A and B regions in Fig. 2a), which were annotated to be ORF25 (a putative homolog of At5g10130, pollen extension family gene), ORF26 (a homolog of At5g10140, FLC) and ORF27 (a predicted homolog of At5g10150, unknown) (Table S1). One copy of ORF25 (segment A, Fig. 2a) could not be annotated, whereas the other copy (segment B in Fig. 2a) contained a complete open reading frame (ORF). The gene fragments in the first ORF25 were identified by the alignment of 94/173 amino acids (54% identity) (E-value ≤ 3E−40), which involved two insertions/deletions and seven premature stop codons in the middle of alignment. The data suggest that one copy of ORF25 was pseudogenized after tandem duplication in A. lyrata. Duplicates of ORF 26 (ORF26-1 (AlFLC2) and ORF26-2 (AlFLC1)) and ORF27 were fully annotated.

In A. arenosa three gene fragments with sequence identity over 90% (E-value ≤ E−20) were identified between ORF26-1 (AaFLC2) and ORF26-2 (AaFLC1). No trace fragment of ORF27 was found, suggesting that ORF27 duplicates are lost. Alternatively, the duplication did not include ORF27 region. Duplication pattern in the vicinity of AaFLC3 was different from that of two intact FLC copies, and only a partial sequence of the adjacent AaFLC1 was duplicated (Fig. 2b). Exon 1 and partial intron 1 of AaFLC1 were duplicated in the 5′ upstream region, which provided two exons and one intron for novel AaFLC3. Exon 1 encoded a MADS box protein in AaFLC3 and conserved among three FLC loci. However, exon 2 in AaFLC3 was unique because it originated from the intronic sequence (Fig. 2b). AaFLC3 was smaller than the full-length FLC because a new stop codon (TAA) was created after exon 2. The remaining intronic sequence became an intergenic region between AaFLC1 and AaFLC3.

In addition to FLC regions, ORF12 (At5g1000, electron carrier, ATFD4) was absent in A. lyrata. The tblastn analysis showed a short stretch of alignment with 16–21 amino acids (76% of identity) in A. arenosa. A noncolinear gene was present in A. lyrata between ORF22 and ORF23, which was homologous to At3g47680 (glutathione transferase) in chromosome 3 of A. thaliana. Non-LTR retroelements were present in A. arenosa and A. lyrata but not in A. thaliana. A LINE retroelement (5295 bp) was found between ORF30 and ORF31 in A. arenosa. This LINE element was homologous to At5g18880 (E-value ≤ 1E−81) in A. thaliana. The LINE was located in the 5′ upstream (c. 340 bp) from the translation start site (ATG) of ORF31 (homologous to At5g10190, carbohydrate transmembrane transporter), suggesting that this LINE insertion may affect expression of ORF31. Another nonLTR retroelement (5600 bp) was located between ORF06 and ORF07 in A. lyrata. The element had sequence similarity (E-value ≤ 1E−102) to At4g29090 in A. thaliana. Some of these retroelements were inserted after separation between A. arenosa and A. lyrata.

Comparative analysis of homologous FLCs and their evolutionary relationships

As shown in Fig. 3(a), the exon number and size were the same in all FLC homologues, except for AaFLC3. The number of introns remained the same, but the size varied, especially in intron 1. The sequences of exons and introns except for intron 1 were highly conserved among different homologues (Fig. S2). The size of genes varied (Fig. 3a), whereas the number of amino acids (aa) for all FLC homologues remained the same (190 aa), indicating variable intron lengths associated with gene size variation. Amino acid sequences among FLC homologues except for AaFLC3 were well conserved, and the sequence identity ranged from 92% to 97%.

Figure 3.

 Analyses of gene structure and evolutionary relationship among FLOWERING LOCUS C (FLC) homologues. (a) Gene structure comparison among FLC homologues (exons, closed boxes; introns, solid lines). The size of exons is invariable and indicated above AtFLC, whereas the size of introns is variable and indicated in each gene. (b) Phylogenetic tree constructed using amino acid sequence of FLC homologues. The neighbor joining method was used to construct the tree, and bootstrap values (%) of 1000 were indicated. The tree was rooted to Brassica napus FLC1 (BnFLC1). (c) Occurrence of tandem duplication in three Arabidopsis species. The arrow indicates when the tandem segmental duplication might have been occurred.

A phylogenetic tree was constructed to infer evolutionary relationships among FLC homologues. AlFLC2 was in the same clade with AaFLC2 rather than with AlFLC1, and AlFLC1 was most diverged from AaFLC1 (Fig. 3b), implying that tandem duplication might have occurred before the speciation of A. arenosa and A. lyrata but after the split between A. thaliana and A. arenosa/A .lyrata lineages (Fig. 3c). The divergent time of tandem duplicates was estimated using synonymous substitution rate (k), which is 6.2 × 10−9 for A. arenosa (Jakobsson et al., 2006) and 1.5 × 10−8 for A. lyrata (DeRose-Wilson & Gaut, 2007). The estimates indicated that AlFLC2 was duplicated c. 2.5 Mya in A. lyrata and AaFLC2 was duplicated c. 2.4 Mya in A. arenosa, before the speciation event between A. lyrata and A. arenosa c. 2 Mya (Koch et al., 2000). Another tandem duplication that created AaFLC3 occurred at c. 0.18 Mya only in A. arenosa after the split between A. arenosa and A. lyrata (Fig. 2a).

The values of synonymous (Ks) and nonsynonymous (Ka) mutations were used to test sequence evolution of FLC duplicates. The Ka : Ks values were < 1 in all tests (0.25 for AtFLC and AaFLC1; 0.23 for AtFLC and AlFLC1; 0.44 for AtFLC and AaFLC2; and 0.27 for AtFLC and AlFLC2). These values are > 0.213, which was based on the analysis of 304 orthologous loci between A. thaliana and A. lyrata (Barrier et al., 2003), suggesting purifying selection for FLC loci. Among orthologous and paralogous FLC loci in A. arenosa and A. lyrata, Ks values were 0.03 for AlFLC1 vs AaFLC1 and AlFLC1 vs AaFLC2, 0.06 for AlFLC2 vs AaFLC1, and 0.08 for AlFLC2 vs AaFLC2. AlFLC2 and AaFLC2 might have diverged at a faster rate than other loci.

Despite protein sequence conservation, the putative promoter regions (1.5–2 kb) diverged rapidly among AaFLC1, AaFLC2 and AaFLC3, except for the core promoter near the translation start site. This region contained a c. 340-bp block including c. 109-bp of 5′ UTR from AtFLC and all other homologues (Fig. 4). AaFLC1 carried a duplicate 340-bp block in the 5′ upstream region. In addition, the first intron of FLC is long (several kb) and is known to contain regulatory sequences for the FLC expression in response to vernalization (Sheldon et al., 2002). All FLC homologues, except for AaFLC3, contained a c. 1000-bp block. Interestingly, this block was found in both the 5′ upstream region and the first intron of AaFLC1 (Fig. 4). The data suggest that AaFLC1 has more regulatory sequence changes than other duplicate alleles. Consistent with the previous results (Wang et al., 2006a), the analysis of RT-PCR coupled with cleaved amplified polymorphic sequence (CAPS) showed that both AtFLC and AaFLC1 expression was well maintained in the resynthesized allotetraploid, whereas AaFLC2 was repressed in the allotetraploid (Fig. 5c). The differential expression of AaFLC1 and AaFLC2 is probably related to the upstream regulatory regions that are divergent between AaFLC1 and AaFLC2.

Figure 4.

 Comparison of putative regulatory elements among FLOWERING LOCUS C (FLC) homologues. Patterns of putative regulatory sequences, 5′ upstream sequence and the first intron are shown. Filled, dark-tinted, light-tinted and unfilled boxes indicate exon, c. 1000-bp block containing intron 1 sequence, c. 340-bp core promoter sequence, and c. 500-bp indel in intron 1, respectively. Solid and dashed lines indicate introns and unclassified 5′ upstream sequences, respectively. ATG indicates translational start codon.

Figure 5.

 Comparative analyses of AaFLC3 and other FLOWERING LOCUS C (FLC) homologues. (a) Comparison of AaFLC3 gene structure with other FLC duplicates. Exon 2 of AaFLC3 was generated from the first intron by creation of a splicing site and an early stop codon (point muations are underlined). Nucleotides corresponding to splicing junction sites are shown. (b) Comparison of deduced AaFLC3 protein with other FLC homologues. AaFLC3 contains a MADS-box domain but lacks a K-box region. (c) Reverse transcriptase polymerase chain reaction (RT-PCR) and cleaved amplified polymorphic sequence (CAPS) analysis of AtFLC, AaFLC1 and AaFLC2 in the leaves of Arabidopsis thaliana (Ler), A. arenosa, and Allo733 before bolting. Left, RT-PCR products were digested with BbsI such that AaFLC1/AtFLC and AaFLC2 were distinguishable. Right, RT-PCR products were digested with ClaI such that AaFLC and AaFLC1/AaFLC2 transcripts were distinguishable. (d) The RT-PCR analysis of AaFLC3 expression in leaves in A. arenosa and allopolyploid Allo733. ACTIN2 gene was used as a control.

AaFLC3 is a novel MADS-box gene

The entire sequences of exon 1 and intron 1 were compared among AaFLC3 and other FLC homologues. Sequence comparison confirmed that only AaFLC3 carried a functional splicing site (AG acceptor) in position 423–424 of intron 1, which created exon 2 of AaFLC3 (Fig. 5a). Other FLC homologues had AA or AT in this position, indicating that A→G point mutation in AaFLC3 creates a functional splicing site, leading to exonization of an intronic sequence. The new splicing site was accompanied by a new stop codon (TAA), producing a truncated FLC gene. A full-length FLC is a type II MADS-box gene that contains functional domains, including the MADS box for protein-DNA interaction, the Keratin-like box (K-box) for protein–protein interaction, and the C-terminal region for tertiary complex formation (Kaufmann et al., 2005). AaFLC3 had a predicted MADS box domain and a short C-terminal like sequence (Fig. 5b), which resembled type I MADS box protein (Kaufmann et al., 2005). However, blastp search against A. thaliana protein database showed that AaFLC3 matched type II MADS-box proteins in the same clade with FLC proteins. RT-PCR analysis in the leaves of A. thaliana (Col and Ler), A. arenosa and Allo733 indicated that AaFLC3 that was expressed in A. arenosa and allotetraploid (Allo733) but not in A. thaliana Col or Ler ecotypes (Fig. 5d).

A model for tandem duplication and generation of new gene in the vicinity of FLC

The data suggest that the most recent common ancestor of the two species undergoes tandem duplication of three adjacent genes (including FLC), followed by pseudogenization of the duplicated genes flanking FLC (Fig. 6). ORF25, ORF26 (FLC) and ORF27 were tandemly duplicated in the common ancestor. The pseudogenization occurred at a different rate on different genes between two closely related species A. arenosa and A. lyrata. Indeed, gene fragments of ORF25 were identified in the location between AaFLC1 and AaFLC2 duplicates in A. arenosa (Fig. 2a), but the remaining fragments of ORF27 were absent (data not shown). After the split between A. arenosa and A. lyrata, ORF25 and ORF27 duplicates were pseudogenized. The ORF27 duplicate was pseudogenized faster than the ORF25 duplicate, such that these gene fragments were lost in A. arenosa. Moreover, another duplication of ORF26 might have occurred in A. arenosa, leading to the generation of a new MADS-box gene, ORF26-3, by acquiring part of intron 1 and an early stop codon.

Figure 6.

 Model for FLOWERING LOCUS C (FLC) evolution. Arrowed boxes, three adjacent genes (ORF25, ORF26 and ORF27); Hatched boxes, gene loss or fragmentation, ORF 26-1 and ORF26-2 are duplicates of FLC in Arabidopsis arenosa and A. lyrata; tinted box, ORF26-3, a partial duplicate of FLC in A. arenosa.


Tandem duplication of FLC locus is subjected to gene loss and purifying selection in A. thaliana-related species A. arenosa and A. lyrata

Gene duplication has been known as a major mechanism for the generation of new gene functions and the origin of adaptive evolutionary novelties (Ohno, 1970; True & Carroll, 2002; Irish & Litt, 2005). Gene duplication is achieved via whole-genome duplication (WGD), local tandem duplication, and/or through retrotransposon (Wolfe, 2001; Nei & Rooney, 2005; Casneuf et al., 2006; Xiao et al., 2008). Tandemly duplicated genes are considered to be relatively younger than the genes duplicated by WGD (Rizzon et al., 2006; Kliebenstein, 2008). In plants, c. 16% of genes in A. thaliana and 14% of genes in Oryza sativa were tandemly duplicated (Rizzon et al., 2006). Tandem gene duplication occurs frequently in disease gene clusters and is mainly caused by unequal crossing over between homologous chromosomes or through unequal sister chromatid exchanges (Tilley & Birshtein, 1985; Meyers et al., 1998; Leister, 2004; Sanchez-Gracia et al., 2009). Tandem gene duplicates may get lost over time or develop new functions (Lynch & Conery, 2003) through the processes known as neofunctionalization (Force et al., 1999) and subfunctionalization (Lynch & Force, 2000). This may occur through modifications of protein-coding sequences, as well as noncoding regulatory sequences that are associated with expression diversity. Some novel adaptive traits, including heavy metal resistance in A. halleri (Hanikenne et al., 2008), boron tolerance in barley (Sutton et al., 2007) and submergence tolerance in rice (Xu et al., 2006) have been associated with tandem gene duplicates.

Our previous study indicated that two FLC alleles are present in A. arenosa and contribute to flowering time variation in A. arenosa and resynthesized allotetraploids (Wang et al., 2006a). Here we showed the structure and organization of FLC loci in A. arenosa and A. lyrata. Phylogenetic analysis among FLC homologs suggests that tandem duplication of FLC occurred before the divergence between A. arenosa and A. lyrata (Fig. 3b,c). Interestingly, the tandem duplication event results in different consequences in the two species. Only FLC duplication retains in A. arenosa, whereas the segmental duplication involving three genes (FLC in the middle) remains traceable in A. lyrata. Ka : Ks values of FLC neighboring genes (ORF25 and ORF27) in A. arenosa and A. lyrata relative to A. thaliana are 0.29 : 0.33 for ORF25 and 0.09 : 0.21 for ORF27, indicating purifying selection on these neighboring genes. In A. arenosa, regional homologous recombination between the tandem arrays may lead to deletion of two neighboring genes in the vicinity of FLC. Alternatively, A. arenosa is a naturally outcrossing autotetraploid, which may complement the loss of genes in this genomic region.

The origin and diversification of FLC duplicates is similar to that of the MADS AFFECTING FLOWERING (MAF) form monophyletic clade, which is highly related to FLC in the MADS box gene family (Alvarez-Buylla et al., 2000). Arabidopsis thaliana has five MAF genes MAF1–5 (Ratcliffe et al., 2001). MAF1 is a floral repressor, but unlike FLC, MAF1 expression is not affected by vernalization (Ratcliffe et al., 2003). Other members, MAF2–5, are tandemly arrayed within c. 24 kb in the chromosome 5. Although MAF2–3 underwent positive selection, MAF4 and 5 were subjected to purifying selection (Caicedo et al., 2009), indicating functional diversification. Moreover, MAF3–4 expression is downregulated, whereas MAF5 is upregulated under short period of cold treatment, indicating expression divergence of duplicate genes in response to vernalization (Ratcliffe et al., 2003). Therefore, tandem duplicate genes may lead to functional and expression diversity and adaptive variation.

Gene fragmentation has also been reported in Brassica oleracea genome, which diverged from A. thaliana c. 20 Mya (Town et al., 2006). Out of 177 annotated genes from a triplicated region spanning c. 2.2 Mb in the B. oleracea genome, fragments of 10 colinear genes are identified. Together, the results indicate that sequence diversification and loss are associated with the evolution of tandemly duplicated genes in many plants including Brassicaceae.

Unlike neighboring genes, FLC duplicates are very well preserved. The ratio of Ka and Ks ranges from 0.23 to 0.44, indicating a purifying selection against mutations or loss of FLC during evolution. Indeed, null FLC alleles have not been found in A. thaliana ecotypes or related species (Michaels & Amasino, 2001), suggesting that FLC is an essential gene for maintaining flowering-time variation. How and why the selection pressures act differently in the FLC and its neighboring genes resulting from the same duplication event remains an interesting question (Rizzon et al., 2006). The tandemly duplicated genes appear to undergo different selection pressures from the WGD duplicated genes because tandem duplication may result in the duplication of one component in the regulatory network, leading to imbalance (Gaut & Ross-Ibarra, 2008). The majority of tandemly duplicated genes were found to encode membrane proteins and the proteins for abiotic and biotic stresses (Rizzon et al., 2006). Membrane-coding, stress responsive, and flowering timing genes tend to be near the end of biological pathways, which is preserved after tandem duplication (Gaut & Ross-Ibarra, 2008).

The coding sequences of FLC homologues are highly conserved but their putative regulatory elements are highly diverged

Coding sequences of FLC homologues are well conserved, suggesting functional constraints for FLC that inhibit early flowering. Unlike the coding region, except for a small block (c. 340 bp) near the core promoter, the upstream regulatory regions (1.5–2.0 kb) among all FLC homologues are highly divergent (Fig. 4 and data not shown). The minimally functional promoter includes 5′UTR (109 bp for AtFLC) and is highly conserved among all FLC homologues (Sheldon et al., 2002; Wang et al., 2006a). Interestingly, a regulatory block in the 5′ regions is found in AaFLC1 but not in AaFLC2 (Fig. 4), which may contribute to the upregulation of AaFLC1 and repression of AaFLC2 in the resynthesized and natural allotetraploids (Wang et al., 2006a). This regulatory block with similar sequences is also located in the first intron of all FLC homologues except for AaFLC3. The regulatory block is shown to be responsive to FLC expression and vernalization effects in A. thaliana (Sheldon et al., 2002). The presence of an extra regulatory copy in the promoter regions of AaFLC1 may be related to the tandem duplication between AaFLC1 and AaFLC3, which involves the first exon and the first intron (Fig. 2b). This 5′upstream regulatory block of AaFLC1 may serve as an extra enhancer element for upregulation of AaFLC1 in the allotetraploids as well as in response to vernalization.

Collectively, the sequence and functional analyses of FLC in A. thaliana-related species (Wang et al., 2006a) suggest that FLC1 and FLC2 duplicates in A. arenosa and A. lyrata share the same function as FLC in A. thaliana. The presence of different 369-bp regulatory regions in FLC homologues of A. thaliana and its closely related species indicate that FLC expression is likely to be regulated differently in these species. Tandemly duplicated genes tend to be rapidly diverged in expression (Casneuf et al., 2006), and duplicated genes are easily released from original selection pressures, providing adaptive and developmental novelty. The different patterns in the regulatory sequences among FLC duplicates may represent expression diversity. If the FLC expression pattern is conserved among duplicates, dosage effects predominate. It needs to be determined empirically whether FLC duplicates exert dosage effects, developmental regulation and/or rapid responses to environment.

The novel MADS gene AaFLC3 is generated through exonization of intronic sequence

AaFLC3 is generated through exonization of intronic sequences. A substitution mutation (A→G) in AaFLC3 created a splicing acceptor site in the exon 2, followed by a new stop codon (TAA) (Fig. 5a). Exonization of intronic sequence, particularly those originating from repetitive sequences such as Alu or retrotransposable elements is found in human (Makalowski et al., 1994), rodent (Wang et al., 2005), dog (Wang & Kirkness, 2005) and many other vertebrates (Alekseyenko et al., 2007). Molecular mechanisms associated with exonization of intronic sequence have been well documented for the evolution of Alu elements inserted into intergenic region between two exons (Lev-Maor et al., 2003; Sorek et al., 2004). The overall occurrence of mutations in Alu elements is remarkably small and mostly concentrated in the 5′ and 3′ splice-site regions. Almost all exonized Alu elements in the human genome are alternatively spliced, and only a small fraction of transcripts contains new exons (Sorek et al., 2002). There is no evidence for the presence of transposable elements in AaFLC3. AaFLC3 was highly expressed in leaves of A. arenosa and allotetraploid Allo733 that was derived from hybridization between A. thaliana (Ler) and A. arenosa tetraploids (Comai et al., 2000; Wang et al., 2004).

AaFLC3 is predicted to encode 90 amino acids as a MADS-box gene (Fig. 5b). The truncated AaFLC3 is a novel MADS-box gene without K-box domain. K-box is known to be responsible for protein–protein interaction during the formation of MADS box transcription factor complex (Kaufmann et al., 2005). The absence of K-domain may affect protein complex formation, compared with original FLC protein. FLC belongs to the type II MADS-box gene, which contains all four functional domains, the MADS box, the intervening domain, the K-box, and the C-terminus (Kaufmann et al., 2005). AaFLC3 resembles the type I MADS-box gene that contains only a MADS box and a C-terminal region. However, AaFLC3 contains one intron, which is different from typical type I MADS-box genes that do not have introns (De Bodt et al., 2003). It is likely that AaFLC3 belongs to a type II MADS-box protein but may function as a type I MADS-box protein. Expression of AaFLC3 in A. arenosa and allotetraploids suggests that AaFLC3 is functional. Whether AaFLC3 affects flowering time or plays a different role remains to be investigated. We suggest that AaFLC3 MADS-box gene encodes a transcription factor for growth and development. The generation and maintenance of AaFLC3 after segmental duplication in A. arenosa may also suggest that AaFLC3 affects growth and developmental patterns that are specific to A. arenosa.


We thank Lu Tian for helping select the BAC clone containing FLC locus and Misook Ha and other members of the Chen laboratory for helpful discussions to improve the manuscript. The work is supported by the grant from the National Science Foundation Plant Genome Research Program (DBI0733857) (Z.J.C.).