Characterization of duplicate gene evolution in the recent natural allopolyploid Tragopogon miscellus by next-generation sequencing and Sequenom iPLEX MassARRAY genotyping

Authors


  • R.J.A.B. uses molecular genetic and bioinformatics approaches to study duplicate gene evolution and the consequences of polyploidy. C.S. uses computational methods to investigate genome architecture and gene expression and alternative splicing, using next generation sequence technologies. W.W. is the manager of the Schnable lab where she conducts research on heterosis and carbon capturing crops. L.G. manages the Genomics Technologies Facility at Iowa State University. G.D.M.’s research focuses on comparative genomics for crop improvement applications. P.S. Schnable’s research focuses on structural and functional analyses of complex plant genomes, primarily maize. He was a co-PI on the maize genome sequencing project P.S. Soltis’ research interests include: plant phylogenetics, polyploidy, gene family evolution, phylogeography and conservation genetics. D.E.S. is interested in angiosperm phylogeny, genome doubling, floral developmental genetics, phylogeography and molecular cytogenetics. W.B.B uses bioinformatics and comparative and functional genomics to investigate plant genome structure and function, gene expression and alternative splicing.

Dr W. Brad Barbazuk, Fax: 352-273-8284; E-mail: bbarbazuk@ufl.edu

Abstract

Tragopogon miscellus (Asteraceae) is an evolutionary model for the study of natural allopolyploidy, but until now has been under-resourced as a genetic model. Using 454 and Illumina expressed sequence tag sequencing of the parental diploid species of T. miscellus, we identified 7782 single nucleotide polymorphisms that differ between the two progenitor genomes present in this allotetraploid. Validation of a sample of 98 of these SNPs in genomic DNA using Sequenom MassARRAY iPlex genotyping confirmed 92 SNP markers at the genomic level that were diagnostic for the two parental genomes. In a transcriptome profile of 2989 SNPs in a single T. miscellus leaf, using Illumina sequencing, 69% of SNPs showed approximately equal expression of both homeologs (duplicate homologous genes derived from different parents), 22% showed apparent differential expression and 8.5% showed apparent silencing of one homeolog in T. miscellus. The majority of cases of homeolog silencing involved the T. dubius SNP homeolog (164/254; 65%) rather than the T. pratensis homeolog (90/254). Sequenom analysis of genomic DNA showed that in a sample of 27 of the homeologs showing apparent silencing, 23 (85%) were because of genomic homeolog loss. These methods could be applied to any organism, allowing efficient and cost-effective generation of genetic markers.

Introduction

Many natural and domesticated plant species are hybrids which have undergone whole-genome duplication. This condition, known as allopolyploidy (Kihara & Ono 1927), may have large effects on both the ecology (e.g. Stebbins 1942; Buggs & Pannell 2007) and evolution (Soltis & Soltis 1999; Adams & Wendel 2005) of a lineage. Genome evolution of allopolyploids has been extensively studied in crop species such as cotton (Adams & Wendel 2004; Udall & Wendel 2006), wheat (Feldman et al. 1997; Levy & Feldman 2004; Dong et al. 2005; Bottley et al. 2006), soybean (Joly et al. 2004) and tobacco (Lim et al. 2004; Petit et al. 2007), as well as genetic models such as Arabidopsis (Chen et al. 2004, 2008). These studies demonstrate dynamic patterns of evolution, but have limitations as a result of uncertainties about the precise history and ecological context of the lineages. Furthermore, they cannot provide insights into the early stages of polyploid evolution in nature. It is therefore difficult to know whether certain evolutionary changes took place in the progenitor diploids, upon allopolyploidization or in the subsequent generations.

A need therefore exists for natural allopolyploid model organisms with a known history and ecological context (Soltis et al. 2004b; Buggs 2008). A handful of species have been identified for this purpose, such as Senecio cambrensis (Hegarty et al. 2005), Spartina anglica (Ainouche et al. 2004), Tragopogon mirus and T. miscellus (Soltis et al. 2004a). Tragopogon miscellus is a particularly tractable evolutionary model for the study of the early generations of allopolyploidy. Its origin can be accurately dated to about 80 years ago (Ownbey 1950; Soltis et al. 2004a). The parental diploid species are known and still coexist with their allopolyploid derivative; both reciprocal crosses of the parents exist in natural populations and at least one of them appears to have originated multiple times (Novak et al. 1991; Soltis et al. 1995; Symonds et al. 2009). Tragopogon miscellus is a textbook example of allopolyploid speciation (e.g. Judd et al. 2007; Sadava et al. 2008).

Unlike the crop species that have been used to study allopolyploid evolution, the natural allopolyploid evolutionary model systems are under-resourced as genetic models. To date, the best resourced is S. cambrensis for which cDNA microarrays have been made to study gene expression (Hegarty et al. 2005, 2006). Until now resources for T. miscellus have consisted of DNA sequence tags for only 23 duplicate gene pairs (Tate et al. 2006, 2009a; Buggs et al. 2009), a handful of phylogenetic markers (Mavrodiev et al. 2005) and 2000 uncharacterized Sanger ESTs (J. Koh, J. Tate, D. Soltis and P. Soltis, unpublished data). This paucity of sequence data contrasts with the usefulness of T. miscellus as an evolutionary model.

One key issue in the evolution of allopolyploids is the fate of duplicated genes. Duplicate gene evolution is important for understanding the evolution of the allopolyploids themselves, and may allow for more general statements about the evolution of duplicated genes in nonpolyploid organisms. Natural allopolyploid models present systems containing a whole genome’s worth of duplicated genes of identical and known age. Duplicated genes may have a variety of evolutionary fates: nonfunctionalization, subfunctionalization and neo-functionalization (Lynch & Conery 2000). Several studies have examined the evolution of homeologs (genes duplicated by whole-genome duplication) in allopolyploids. Studies in crop species have shown homeolog loss (e.g. Song et al. 1995; Kashkush et al. 2002) and patterns of homeolog expression suggestive of subfunctionalization (e.g. Adams et al. 2003; Flagel et al. 2008).

In natural models, our knowledge of homeolog evolution is limited. In the S. cambrensis cDNA microarray, the oligo-nucleotides used did not distinguish between homeologs: measures of gene expression were the total expression of both homeologs. In T. miscellus, loss and silencing of homeologs occurred in the early generations of allopolyploidy (Tate et al. 2006, 2009a; Buggs et al. 2009) based on analysis of only 20 homeolog pairs using PCR-based methods. New surveys are needed that will move us from a gene-by-gene approach to a genomic level. This requires a dramatic increase in the genomic resources available for plants that are good evolutionary models but not genetic models. We wished to develop a protocol that would produce a large number of homeolog-specific markers in T. miscellus at minimal time and expense, allowing us to assess homeologous gene loss and silencing.

Sequencing of cDNA or expressed sequence tags (EST) provides a rapid method for gene discovery and can be used to identify transcripts associated with specific biological processes. As such, it is often a first step in the genomic characterization of an organism. Variation in ESTs can be characterized by single nucleotide polymorphisms (SNPs), which are single-base differences between haplotypes. Transcript-associated SNPs can be used to develop allele-specific assays for the examination of cis-regulatory variation within a species (Guo et al. 2004; Stupar & Springer 2006) and may provide a rapid means to investigate differential expression and gene gain/loss within polyploids. EST collections and SNP discovery rely on DNA sequencing, which until recently was prohibitively costly for most evolutionary studies.

Recent advances in high-throughput sequencing technology provide rapid and cost-effective means to generate sequence data (Stupar & Springer 2006; Ellegren 2008; Hudson 2008). This new paradigm, termed flow-cell sequencing (reviewed in Holt & Jones 2008), consists of stepwise determination of DNA sequence by iterative cycles of nucleotide extensions done in parallel on huge numbers of clonally amplified template molecules. This massively parallel approach enables DNA sequence to be acquired at extremely high depths of coverage in less time and for less cost than traditional sequencing. The 454-FLX produces 200 000 sequences per run with ∼200–300 bp lengths (100 Mb). With new Titanium reagents, this can be increased to over 1 million sequences with ∼350–400 bp read lengths (400–600 Mb per run). In contrast, the Illumina Genome Analyzer (GA) II DNA sequencing instrument can produce >80 million sequences, each of which is 36 bp in length (>2 Gb). Short read lengths can confound assembly and alignment programs, but the reduction in read length vs. increased depth of coverage is an acceptable trade-off for many resequencing applications such as transcript expression profiling (Eveland et al. 2008), in vivo DNA binding site detection (Johnson et al. 2007) and polymorphism detection (Barbazuk et al. 2007; Novaes et al. 2008; Van Tassell et al. 2008). In the latter application, a high volume of short reads is very powerful in discriminating sequence variants, enabling reliable SNP discovery, so long as each read is long enough and accurate enough to align uniquely to the reference sequences.

To permit gene discovery and genomic tool development in species with few genomic resources, we designed a hybrid sequencing approach. In this approach, the Roche 454 sequencer is first used to generate transcriptome or genomic sequences that can be assembled and used as reference sequences (as in, e.g. Novaes et al. 2008). We then use this reference for subsequent alignment of Illumina short reads. This method gains maximum leverage from the longer read lengths of 454 sequencing and the deeper coverage of Illumina. Assembling 454 sequence reads is less problematic than Illumina reads, making it the high-throughput sequencing method of choice for species with few genomic resources and it is particularly useful in transcriptome characterization (Cheung et al. 2006, 2008; Emrich et al. 2007; Novaes et al. 2008). The 454 assemblies can therefore be used for gene annotation and the Illumina sequences used to identify SNPs and examine relative expression differences.

Once SNPs have been identified, a highly efficient way to validate them and carry out large-scale surveys of their frequencies is the Sequenom MassARRAY iPLEX genotyping platform (Gabriel et al. 2009). In this method, a short section of DNA containing a SNP is amplified from an individual by PCR. This is followed by a high-fidelity single-base primer extension reaction over the SNP being assayed, using nucleotides of modified mass. The different alleles therefore produce oligonucleotides with mass differences that can be detected using highly accurate Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight mass spectrometry. Up to 40 different SNPs can be multiplexed in one assay if primers are designed by custom software to give unique mass ranges for each SNP. This method is especially suited for detecting homeologs which differ in only a few SNPs as, unlike microarrays which rely on hybridization of oligonucleotides, it detects differences by single-nucleotide extension over SNPs.

In this study, we demonstrate the utility of hybrid next-generation sequencing and Sequenom genotyping for the study of homeolog evolution in T. miscellus. We report the transcriptome characterization of T. dubius, one of the diploid progenitors of T. miscellus, with 454 sequencing and the subsequent discovery of over 24 000 SNPs between T. dubius and the other parental diploid species, T. pratensis, using Illumina sequencing. We validated a subset of 98 SNPs that represent homeolog pairs in T. miscellus at the genomic level using Sequenom MassARRAY iPLEX genotyping. In addition, expression profiling of a T. miscellus individual using Illumina sequencing was performed. We assessed the utility of this profile for the selection of candidate genes for the investigation of loss from the genome. These methods could be applied to any organism, allowing efficient and cost-effective generation of genetic markers.

Materials and methods

Seeds were collected from natural populations of allotetraploid Tragopogon miscellus (Soltis and Soltis collection no. 2671) and its diploid parent species, T. dubius (collection no. 2674) and T. pratensis (collection no. 2672), in Oakesdale, WA. The three species grow in sympatry in this location, and this fact, together with microsatellite data (Symonds et al. unpubl. data), suggest that the diploid populations were the source of the progenitors of the allotetraploid population. These seeds were germinated and grown in an air-conditioned greenhouse with supplementary lighting at the University of Florida (Gainesville, FL, USA). Tragopogon miscellus from Oakesdale is the short-liguled form, with T. pratensis as the maternal parent (Soltis & Soltis 1989; Soltis et al. 1995).

RNA was extracted from leaf tissue of three individuals from Oakesdale: T. dubius 2674-4 (ID no. 3911), T. pratensis 2672-5 (ID no. 3913) and T. miscellus 2671-1 (ID no. 3912). Basal leaf tissue from each plant was flash frozen and ground in liquid nitrogen using a pestle and mortar. RNA extractions were performed following a portion of the CTAB DNA extraction protocol (Doyle & Doyle 1987) and subsequent use of the RNeasy Plant Mini Kit (Qiagen) with on-column DNase digestion. This method was originally developed for the successful extraction of RNA from Amborella and Nuphar (Kim et al. 2004) and copes well with the latex produced by Tragopogon photosynthetic tissue. This was followed by an RNA cleanup using the protocol of the RNeasy Plant Mini Kit. These extractions were quality-checked using the Agilent 2100 Bioanalyzer (Agilent Technologies).

454 EST sequencing and processing

Using the T. dubius RNA, a normalized cDNA library was produced via the following method. The Evrogen MINT cDNA synthesis kit (Evrogen) was used to produce double-stranded cDNA following the manufacturer’s protocol. This cDNA was cleaned using the Wizard® SV Gel and PCR Clean-Up System (Promega). The Evrogen TRIMMER cDNA normalization kit (Evrogen) was used to normalize and amplify the cDNA library, following the manufacturer’s instructions. In the normalization step, a 0.5 dilution of the duplex-specific nuclease was found to be optimal. In the amplification step, 12 cycles were found to be optimal. The resulting normalized library was used for 454 sequencing.

454 sequencing was performed as described in the supplementary material and methods to Margulies et al. (2005) with slight modifications as specified by 454 Life Sciences. Briefly, cDNA was sheared by nebulization to a size range of 300–800 bp. DNA fragment ends were repaired and phosphorylated using T4 DNA polymerase and T4 polynucleotide kinase. Adaptor oligonucleotides ‘A’ and ‘B’ supplied with the 454 Life Sciences sequencing reagent kit were ligated to the DNA fragments using T4 DNA ligase. Purified DNA fragments were hybridized to DNA capture beads and clonally amplified by emulsion PCR. DNA capture beads containing amplified DNA were deposited on a 70 × 75 mm PicoTiter plate and DNA sequences determined using the GS-FLX instrument. This resulted in 822 594 EST sequences. The T. dubius 454 EST sequences were assembled with the Newbler assembler, a part of the software package distributed with 454 sequencing machines. Newbler is an assembler that takes into account the specifics of pyrosequencing errors to generate accurate contigs (Chaisson & Pevzner 2008). Our assembly used the default directives and a vector trimming database including the Evrogen primer and 454 adapter sequences.

Comparisons of 454 ESTs to public sequence database (annotation)

Assembled and annotated contig EST assemblies and singletons were obtained from the curated Gene Indices Project (Quackenbush et al. 2000; http://compbio.dfci.harvard.edu/tgi/) from three other species in the Asteraceae: Lactuca sativa (ver. 3.0), Lactuca serriola (ver. 1.0) and Helianthus annus (ver. 5.0). These sequences were pooled, formatted into a blastable database and aligned to the T. dubius 454 EST assemblies with WU-TBLASTX (version 2.0), which translates both the query and subject sequences in all 6 potential reading frames prior to alignment, to identify the top hit for each T. dubius contig (P-value ≤ 1e−05 and ≤ 1e−10). The T. dubius 454 EST contigs were also BLASTX-aligned to Arabidopsis CDS sequences (TAIR version 8) because Arabidopsis represents the best curated plant genome available. Top hits for each T. dubius contig to the Arabidopsis protein set were identified (P-value ≤ 1e−05 and ≤1e−10). Similarity search results are summarized in Table 1.

Table 1.   Results of Tragopogon dubius similarity searches (blastx)
Sequence collection used for similarity searchesNo 454 contigs with similarity at 1e−05No. annotation sequences hit at 1e−05No. 454 contigs with similarity at 1e−10No. annotation sequences hit at 1e−10
Lactuca sativa, Lactuca serriola and Helianthus annus Gene Index21 498
(Lactuca sativa: 11 080
Lactuca serriola: 6078
Helianthus annus: 4340)
16 61118 526
(Lactuca sativa: 9731
Lactuca serriola: 5264
Helianthus annus: 3531)
14 914
Arabidopsis annotated peptides18 92311 08616 41210 180

Illumina sequencing

The RNA extractions from T. dubius 2674-4 (ID no. 3911), T. pratensis 2671-1 (ID no. 3912) and T. miscellus 2672-5 (ID no. 3913) were used for Illumina sequencing. Poly A+ RNA was isolated from total RNA through two rounds of oligo-dT selection (Dynabeads mRNA Purification Kit, Invitrogen Inc.). The mRNA was annealed to high concentrations of random hexamers and reverse transcribed. Following second strand synthesis, end repair and A-tailing, adapters complementary to sequencing primers were ligated to cDNA fragments (mRNA-Seq Sample Prep Kit, Illumina). Resultant cDNA libraries were size fractionated on agarose gels and 250 bp fragments were excised and amplified by 15 cycles of polymerase chain reaction. Resultant libraries were quality assessed using a Bioanalyzer 2100 and sequenced for 36 cycles on an Illumina GA II DNA sequencing instrument using standard procedures.

SNP discovery

All Illumina reads from the T. dubius and T. pratensis parents and the T. miscellus allotetraploid were labelled with species identifiers, pooled and aligned to the T. dubius 454 FLX contigs with the MosaikAligner package (Hillier et al. 2008) using the following MosaikAligner parameters: -a (alignment algorithm) all; -p (CPUs used) 8; -mm (maximum mismatch) two in a preliminary analysis and one in a final analysis; -m (alignment mode) unique; -hs (hash size) 15; -mhp (maximum number of hash positions to use) 100. These alignment parameters ensured that each Illumina sequence aligned to a unique position within the 454 T. dubius EST assembly reference sequences and with no more than one base-pair mismatch in the final analysis. Illumina reads that did not align with the 454 contigs under these stringent conditions were discarded from the analysis.

SNPs were identified within the alignments with the GigaBayes package (http://bioinformatics.bc.edu/marthlab/GigaBayes). GigaBayes is a reimplementation of the PolyBayes (Marth et al. 1999) SNP discovery tool that has been optimized for next-generation sequences. Arguments to GigaBayes were: –D (pairwise nucleotide diversity) 0.001; –ploidy (sample ploidy) diploid; –sample (sequence source) multiple;–anchor; –algorithm banded; –CAL (minimum overall allele coverage) 3; –QRL (minimum base quality value) 20. Custom PERL scripts were written to automate the SNP discovery process on all alignments to reference contigs and to parse the GigaBayes output files (GFF), which contain the site identification of each SNP, its representation within each of the three Tragopogon species (T. dubius, T. pratensis and T. miscellus) and its allele usage.

Any site where both the T. pratensis and T. dubius homeologs were evidenced in the T. miscellus data was flagged as a suitable SNP for the study of homeolog loss in T. miscellus. Where both homeologs were present in at least 10 T. miscellus Illumina reads, and the observed allelic ratio was more balanced than 70:30 in either direction, we took this as preliminary evidence that both homeologs were equally expressed. In contrast, any site where either the T. pratensis or T. dubius parental homeolog was present at 10× while the other was absent, was identified as suggestive of either complete silencing of one parental homeolog or genomic homeolog loss.

SNP validation

A subset of SNPs identified using the above methods was analysed using the Sequenom MassARRAY iPLEX platform at the Center for Plant Genomics, Iowa State University. Genomic DNA was extracted from leaf tissue of the three plants used for the transcriptome sequencing, using a modified CTAB protocol (Doyle & Doyle 1987). Multiplexed assays were designed using the Sequenom Assay Design 3.1 software for four plexes containing a total of 139 SNPs between T. dubius and T. pratensis. Of these, 42 were scored as ‘potential gene loss’ using the Illumina read data, 77 were scored as ‘alleles balanced’, 19 were scored as ‘low coverage in T. miscellus’ and one had no T. miscellus reads. This assay design was used to genotype a 384-well plate that included T. dubius, T. pratensis and T. miscellus genomic DNA samples (∼20 ng/μL). The resulting data were analysed using the MassARRAY Typer 4.0 Analyzer software. Using the manufacturer’s settings, the Sequenom software was used to call SNPs at ‘aggressive’, ‘moderate’ and ‘conservative’ degrees of confidence.

Results

454 Sequencing, assembly and annotation of T. dubius cDNA sequences

454 FLX sequencing of the normalized T. dubius cDNA pool from T. dubius leaf tissue produced 822 594 reads (237 bp av. length) representing >195 Mb of sequence. These reads have been uploaded to the NCBI Short Read Archive (accession no SRA009218.13). Assembly of the 454 FLX reads with the Roche 454 Newbler assembler produced 33 515 contigs (14.7 Mb) with an average length of 439 bp (min = 96, max = 3418), an average depth of 17.6 reads and N50 Contig Size of 626 bp (see Fig. 1).

Figure 1.

 Analysis of Newbler assembly of 454 reads showing frequency of contigs in different length and coverage categories.

In comparison with other species in the Asteraceae, 21 498 (64%) of the T. dubius 454 EST sequences matched previously characterized EST assemblies (tblastx) from Lactuca sativa, L. serriola and Helianthus annus with P-values of e−5 or better. This low percentage may reflect the low depth and coverage in many of our 454 contigs (Fig. 1) or significant divergence among the species. Of the 21 498 hits, 18 526 (86%) were to unique EST assemblies in this curated database. The 14% of nonunique contigs may be due to paralogous sequences in T. dubius or to nonoverlapping assemblies of T. dubius sequence from the same cDNA template, as the ‘shotgun’ nature of 454 sequencing enables simultaneous sampling of discrete template regions. The majority of best matches occurred between T. dubius and L. sativa (Table 1). In comparison with A. thaliana, 18 923 T. dubius 454 EST contig assemblies match A. thaliana CDS sequences (tblastx) at P-values of e−5 or better, while a total of 22 946 T. dubius contigs hit at least one sequence in either the A. thaliana or the Asteraceae collection.

SNP discovery

Nonnormalized cDNA pools sequenced on single lanes of an Illumina GAII Analyzer resulted in 7 128 226, 6 840 425 and 6 729 215 reads from T. dubius, T. pratensis and T. miscellus, respectively. These reads have been uploaded to the NCBI Short Read Archive (accession no. SRA009218.13). Alignment of pooled Illumina reads to the T. dubius 454 assembled EST reference sequences with a mismatch tolerance of 2 bp followed by identification of polymorphic sites that were represented to a minimum of threefold redundancy in both T. dubius and T. pratensis revealed >45 000 potential SNPs within 10 428 contigs. To reduce the risk of misaligning repetitive or highly paralogous sequences, parameters were adjusted to permit only a single mismatch over the length of the Illumina reads. Of the total pooled T. dubius, T. pratensis and T. miscellus Illumina reads, 11 050 022 (53.4%) aligned. The remaining reads were unaligned because they did not map a unique location in the 454 contig reference sequence collection or they did not meet the single mismatch criterion. This higher confidence alignment, when parsed for polymorphic sites that were represented to a minimum of threefold redundancy in both T. dubius and T. pratensis, resulted in the identification of 24 078 potential SNPs between T. dubius and T. pratensis within 7837 unique 454 EST contig reference sequences. To identify an even higher-quality collection of potential SNP sites between T. dubius and T. pratensis, the aforementioned alignments were parsed for SNP sites that were represented to a minimum depth of 10× in both the T. dubius and T. pratensis data sets. This high-quality collection that maximizes the likelihood that discovered polymorphic sites represent true SNPs between T. dubius and T. pratensis consists of 7782 SNPs within 2885 unique contigs.

Of the 7782 SNPs, 2989 had sufficient T. miscellus Illumina reads for transcriptome analysis. Of these, 2064 (69%) appeared to show equal homeolog expression in T. miscellus, 671 (22%) showed differential expression in T. miscellus and 254 (8.5%) showed potential homeolog loss in T. miscellus. Interestingly, the cases of differential expression were mainly because of higher expression of the T. dubius homeolog than of the T. pratensis homeolog (454/671; 77%) and most of the apparent losses were also of the T. dubius homeolog (164/254; 65%) rather than the T. pratensis homeolog (90/254).

SNP validation

Sequenom MassARRAY iPLEX assays were designed for 139 of the putative SNPs (four plexes). These assays were used to analyse the genomic DNA of the two diploid plants whose transcriptomes were used for 454 and Illumina sequencing. For 19 of the assays, the Sequenom assay failed to call a SNP in both diploid species and 22 assays only worked in one of the diploid species. This failure rate is comparable to those obtained by other groups (Dunstan et al. 2007). Of the 98 informative assays (Table 2), 92 (94%) confirmed the SNP calls. In five of the remaining assays, the correct polymorphism was present but there was an extra allele in the genome of one diploid (i.e. heterozygosity) that had not been detected by via transcriptome sequencing. In only one case did the base call differ between the sequencing and Sequenom methods: here Sequenom indicated the same base in both alleles.

Table 2.   Comparison of single nucleotide polymorphism calls from Illumina read data and Sequenom data in diploid Tragopogon dubius (Td) and T. pratensis (Tp) and allotetraploid T. miscellus (Tm). Td-h = T. dubius homeolog; Tp-h = T. pratensis homeolog
SNPDiploids SNP call (Td/Tp)T. miscellus callIllumina TmSequenom TmPutative identity of contig
ContigPositionIllumina (cDNA)Sequenom (gDNA)Illumina (cDNA)Sequenom (gDNA)Td-h countTp-h countTd-h areaTp-h areaFractional UEPConfidence(Arabidopsis thaliana CDS blastn at 1e−10 max)
124541T/CT/CCC0156.7437.170.10ModRieske (2Fe-2S) domain-containing protein
124955T/AT/A A0150.0028.720.00ConRieske (2Fe-2S) domain-containing protein
1351331G/AG/AGAGA211625.2023.190.25ConPorphobilinogen synthase
3031083C/TC/TCTCT222224.7834.080.13ConNonphotochemical quenching 4
938757A/GA/GGG0440.0020.000.55ConHistone H1.2
9922152C/TC/TTCTC131125.5618.200.00Aggalpha-l-Arabinfuranosidase
1054794C/TC/TCTCT111316.6018.500.67ConChaperonin 60 beta
1180260A/GA/GGAGA261615.0754.830.07AggCBL-interacting protein kinase
1290672A/CA/C A0024.330.000.27ConLipase class 3 family protein
1317680G/AG/AGAGA262118.8430.270.00ConRibosomal protein L10 family protein
1537756T/CT/CCTCT393527.5824.600.00ConLipoxygenase 2
1582442A/CA/CCACA142210.2211.400.05ConNo hit
15901089C/GC/G G1560.0056.850.00ConUnknown protein
1734939C/TC/TTCTC251949.2351.440.02ConGlutamate-ammonia ligase
1792510C/TC/TTCTC141714.6311.230.00ModUnknown protein
2014655C/TC/TCTCT201818.1012.150.00AggFlavin reductase-related
2033296T/CT/CCC06911.0541.090.04AggNo hit
2033502A/CA/CCC01349.3533.530.26AggNo hit
22241209A/TA/TTATA181323.5629.550.13ConBinding protein
2342622G/AG/AAA0350.0079.620.33ConIsoflavone reductase, putative
2348489T/AT/AAA0210.6236.540.00ConNo hit
24131284G/AG/AAA02211.8734.760.00AggCytochrome P450 family protein
2413393T/AT/AAA0190.0040.260.17ConCytochrome P450 family protein
2881267A/GA/GGAGA301633.9336.640.00ConWLIM1; transcription factor
3264165G/CG/CCGCG141132.6326.960.04ConIntegral membrane family protein
3354511C/TC/TCTCT262121.2223.830.00ConRemorin family protein
3631199T/CT/CTTC12031.8632.340.01Con40S ribosomal protein S19 (RPS19B)
4514218A/GA/GGG0240.0032.680.59ConMalate dehydrogenase
5824286G/AG/A A033.2737.900.00ConUnknown protein
5824613A/GA/GGG0142.6118.620.61ConUnknown protein
62241515T/CT/CCTCT121416.7214.110.35ConKinase/protein kinase
6494282A/GA/GAGA35014.8716.270.06ConPhosphoserine transaminase
6494747T/GT/G GT6121.2519.000.00ConPhosphoserine transaminase
7252102C/AC/AAA0280.0040.060.27ConUnknown protein
72521131A/CA/C C092.0732.470.33ConUnknown protein
72591241A/GA/GGG0690.0023.220.67ConHistidine kinase 3
73251564T/CT/CTCTC111115.7620.160.28ModTranslation initiation factor
81241348C/TC/TCCT12030.1824.750.01AggDEAD/DEAH box helicase, putative
8429719A/GA/GGAGA152321.1222.340.00ConHydrolase
8583137C/TC/TTT0250.0030.320.56ConRNA recognition motif (RRM)-containing protein
9336788C/TC/TCC22017.910.000.65ConPhosphofructokinase family protein
9560630A/GA/GAGAG181433.8532.280.00ConAspartyl protease family protein
9797600T/AT/AATAT232220.4221.270.00ConSteroid 5-alpha-reductase family protein
12 120154A/GA/GAGAG152320.3025.230.11ConThiamine biosynthesis family protein
20 550301A/GA/GAGAG161028.3133.930.00ConVacuolar-type H(+)-ATPase C3
26 597127G/AG/AAA0860.0068.200.02Con50S ribosomal protein L24, chloroplast
26 640802A/GA/GAGAG311732.4620.700.02ModNo hit
27 6441106G/AG/AAA02716.4562.280.00ConGlucose-6-phosphate isomerase, cytosolic
27 9151673T/CT/C TC18923.3818.840.17ConProtein kinase family protein
27 980907G/AG/AAA0140.0066.870.15ConEndopeptidase
28 066797A/GA/GGAGA243824.8043.410.01ModRieske iron-sulphur protein, putative
28 124122G/AG/AAA0147.6046.180.00ConHeat shock protein 93-V; ATP binding
28 164315T/GT/GGG0132.4629.340.14ConNo hit
28 2371215T/GT/GGTGT504333.2222.900.00ModMethionine adenosyltransferase
28 267771A/GA/GGG0214.9933.950.45Con3-Hydroxybutyrate/phosphogluconate dehydrogenase
28 324262C/TC/TTCTC182119.8516.610.00ModLycopene beta cyclase
28 508963A/GA/GGAGA31159.6222.240.57AggLipoic acid synthase
28 892643C/TC/TCTCT112033.8144.870.03ConLon protease homologue 1, mitochondrial
29 164444T/CT/CTTC2109.058.540.19ConArginine decarboxylase 1
29 903339T/CT/CTCTC44833112.1917.000.25AggPhotosystem 1 subunit K
30 444254C/GC/GCGCG567225.3217.060.00ConAcyl-CoA binding
30 597368T/CT/CTT47031.030.000.17ConShort-hypocotyl 2; transcription factor
30 695913T/AT/AATAT504627.2235.210.00ModBinding/catalytic/coenzyme binding
31 129555T/AT/ATATA172230.9522.470.00ModUPD-d-glucuronate 4-epimerase 6
31 2221366G/AG/AGAGA301927.8121.440.00ModRadical induced cell death 1
31 2371221G/CG/CGCGC131528.2030.100.00ConChitinase
31 552528T/CT/C-TC11261.4550.200.01ModChaperone protein dnaJ-related
31 620939G/AG/AAGAG161520.9122.450.06ConRibosomal protein L3 family protein
31 896271G/CG/CCGCG544233.4423.220.00ModOxidoreductase/protochlorophyllide reductase
31 956513T/CT/CCTCT46038136.0726.330.02ConPhotosystem 2 subunit O-2
32 1351123C/TC/TCTCT132439.7324.360.20ConHydroxymethylglutaryl-CoA lyase
32 924152C/TC/T C17015.880.000.73AggPlastid developmental protein DAG, putative
32 924545C/TC/TCC36018.986.760.65ConPlastid developmental protein DAG, putative
33 0101038T/GT/G G035.8634.870.27AggSMP1 (swellmap 1); nucleic acid binding
33 319126C/TC/TCTCT152511.6516.040.18ConRare cold inducible 2B
33 387336G/TG/TTT0280.0012.870.50ModPlastid-specific 50S ribosomal protein 5
33 552356A/GA/GGAGA585018.7112.620.00ConAspartyl protease family protein
No Sequenom call in T. miscellus
 825880A/GA/G  31012.2826.450.02LowSelenium-binding protein, putative
 437101A/GA/GAG 12178.7427.450.04LowNo hit
 25421098G/AG/AGA 101524.7911.630.00LowUDP-d-apiose/xylose sythase 2
 3134410T/CT/CTC 101214.1629.800.42LowATPase, coupled to transmembrane transport
 4524162C/TC/TCT 242134.3114.820.01LowElectron carrier/protein disulphide oxidoreductase
 11 285699T/CT/CTC 449531.4916.470.00LowUnknown At protein
 27 9151990C/TC/TC 11023.159.310.03LowProtein kinase family protein
 28 122667T/CT/CTC 422252.6218.880.08LowHeat shock protein 93-V; ATP binding
 30 1931092C/TC/TCT 121631.7215.370.30LowSorbitol dehydrogenase, putative
 31 548153A/TA/TAT 253611.045.140.80LowNo hit
 32 692696T/CT/C  17242.308.250.04LowUDP-glucose 6-dehydrogenase, putative
Heterozygosity in diploid revealed by Sequenom
 6461149T/CT/CTTC 281942.4013.930.01LowSucrose–phosphate synthase/transferase
 2071437C/TTC/TTCTC323036.5940.770.00ModMalate dehydrogenase, mitochondrial
 3282690T/CT/CTCTCT171624.1520.720.00ModStructural constituent of ribosome
 8583808T/CCT/C C050.0031.220.32ConRRM-containing protein
 31 924322T/ATA/AT 66010.9531.330.12LowOxidoreductase/protochlorophyllide reductase
Illumina and Sequenom calls differ in diploids
 976919T/CT/TTCT181340.890.000.00ConMethionine aminopeptidase 1B; metalloexopeptidase
Apparent preferential binding of Sequenom primers when both homeologs are present
 6037893C/TC/TCTC553043.617.220.00ModInositol pentakisphosphate 2-kinase
 10 909579G/AG/AGAG101250.1815.110.00AggRNA binding
 28 117519A/GA/GAGA122254.0214.950.08AggHydroxyproline-rich glycoprotein family protein
 31 6371550C/TC/TCTC141254.9111.320.19AggATP synthase, rotational mechanism

We then examined the Sequenom data for the genomic DNA of the T. miscellus plant. Of the 139 SNP assays, 41 did not successfully call any bases within our confidence limits in this plant. In 28 of these 41 cases, the assay also failed to call a SNP in one or both diploid species, but in the remaining 13 cases, the assay called a SNP in both diploid species but not in T. miscellus (see Table 2). In another 13 cases, one or more SNPs were called in T. miscellus, but a base had only been successfully called in one of the diploids (not shown in Table 2). Thus, in total, 85 of the 139 Sequenom assays (61%) provided a call. In no cases did we find a SNP homeolog present in T. miscellus that had not been found in either T. dubius or T. pratensis at that locus.

Only the Sequenom data for 81 assays were used to infer homeolog loss in T. miscellus. Of the 85 assays that worked in all three plants, three were excluded because of heterozygosity in T. dubius and a fourth because of an identical call in both diploids. Of the 81 assays used, 47 gave evidence in T. miscellus of both T. dubius and T. pratensis SNP homeologs, and 34 gave evidence of only one SNP homeolog. Thus, 41% of the SNP loci give evidence for homeolog loss. Of these, 25 (74%) showed loss of the T. dubius homeolog, and nine (26%) showed loss of the T. pratensis homeolog. If we increase stringency by omitting ‘aggressive’ calls (i.e. less confident Sequenom calls), we find that 69 assays gave a call; of these, 44 gave evidence of both SNP homeologs in T. miscellus and 26 gave evidence of only one SNP homeolog.

We then compared the Sequenom and Illumina sequence data for the T. miscellus plant, to discover how often Illumina expression data had successfully identified a candidate for genomic loss (Table 3). Illumina read counts were correct in 89% of the cases where there was depth of coverage above 10× per SNP homeolog, and the Sequenom calls were at conservative, moderate and aggressive levels of confidence (listed in descending order). Where Illumina read data had predicted ‘potential gene loss’, this was shown by Sequenom analysis in 23 of 27 cases (85%). In four cases, homeologs were detected in the genomic DNA by Sequenom but not in the transcriptome by Illumina (i.e. they were scored as ‘potential gene loss’). This may be due to homeolog silencing. In contrast, four SNP homeologs were detected in the transcriptome by Illumina (i.e. they were scored ‘alleles balanced’) but were not found in the genome by Sequenom. Manual examination of the mass spectrometer traces for these calls suggested that three of them, which had all been called at the ‘aggressive’ (lowest) level of confidence, did in fact have both homeologs present in the gDNA. In no cases did a contradiction occur where Illumina showed no expression of one homeolog and Sequenom loss of the other homeolog.

Table 3.   Comparison of Sequenom genomic DNA calls and Illumina cDNA reads for SNP loci, to assess usefulness of Illumina transcriptome profiling for identifying candidate genes for homeolog loss
Illumina read depth scoreSequenom: conservative, moderate and aggressive callsSequenom: conservative and moderate calls
One homeolog calledBoth homeologs calledOne homeolog calledBoth homeologs called
Potential gene loss234203
Homeologs balanced442137
<10× coverage7353
Total34492643

Discussion

Genomic resources are scarce for many organisms that are studied in a natural ecological or evolutionary context (Ellegren 2008; Hudson 2008). Here, we demonstrate a protocol that uses next-generation technologies to rapidly develop SNP markers in many hundreds of genes in a species which is a good evolutionary model but which until now has not been a genetic model organism. Such SNP markers have many potential uses (e.g. Cannon et al. 2010, Renaut et al. 2010, Van Bers et al. 2010, Whittall et al. 2010; all this issue). We have used them to distinguish between homeologous genes in the recent natural allopolyploid T. miscellus. Using transcriptome profiling and Sequenom genotyping, we have detected many cases of gene loss. Below we discuss the biological implications of our findings in T. miscellus and the general utility of the methods described in this paper.

Biological implications of findings in T. miscellus

This paper provides the first large-scale analysis of homeologous gene loss in a recent (∼40-generation-old) natural allopolyploid. In a single T. miscellus individual, we found 254 cases of putative homeolog loss or silencing by transcriptome profiling with Illumina sequencing (3% of all SNPs). Sequenom analysis confirmed that in a sample of 27 of these SNPs, 23 (85%) were cases of genomic homeolog loss. The remaining 15% are likely to be homeologs that are present in the genome but were not being expressed at the time of sampling in the leaf tissue subject to transcriptome analysis. Homeolog loss therefore appears to be more common than homeolog silencing (i.e. lack of expression of a gene found in the genome) in this species.

We found preferential loss of T. dubius homeologs over T. pratensis homeologs in the allopolyploid T. miscellus in this study. Illumina read data on the transcriptome suggested loss or silencing of the T. dubius homeolog in 164 of 254 SNPs (64%) showing homeolog loss or silencing, and Sequenom analysis of the genome suggested loss of the T. dubius homeolog in 25 of 34 SNPs (73%) showing homeolog loss. In earlier studies, a similar bias was found: combined results from Buggs et al. (2009), Tate et al. (2006, 2009a) gave 56 T. dubius homeolog losses and 27 T. pratensis homeolog losses across multiple populations. Interestingly, we also found a bias in gene expression in our Illumina read data, with T. dubius homeologs tending to be expressed more than T. pratensis homeologs in 77% of the SNPs where we detected differential expression. Because T. dubius ESTs were used as the reference sequence we might expect a bias towards the alignment of Illumina reads derived from T. dubius homeologs. This possible bias may have contributed to the apparent higher expression of the T. dubius homeolog at many loci, but also suggests that the finding of a higher rate of loss of T. dubius homeologs is a robust result.

It is notable that a similar bias towards loss of T. dubius genetic material and higher expression of T. dubius genes has been found for rDNA in both T. miscellus and T. mirus, an allopolyploid that has T. dubius as the paternal parent and T. porrifolius as the maternal parent (Kovarik et al. 2005), In both species, concerted evolution has reduced the copy numbers of rDNA units derived mainly from the T. dubius diploid parent but, paradoxically, repeats of T. dubius origin dominate transcription in most populations studied (Matyasek et al. 2007). Tragopogon mirus also shows a bias towards loss of T. dubius homeologs using CAPS markers (Koh et al., in press).

What causes the bias towards higher rates of gene loss and increased expression of T. dubius homeologs? One possibility might be maternal effects as a result of cytoplasmic-nuclear interactions. The T. miscellus plant in the current study, as well as all T. mirus plants and the majority of T. miscellus plants included in other studies, has T. dubius as the paternal parent. Perhaps selection favours maintaining ancestral similarity in the cytoplasmic and nuclear genomes. Another explanation might be the higher genetic variability of T. dubius populations (Soltis et al. 1995); it is possible that the T. dubius individual that we examined from Oakesdale was not genetically identical to the actual T. dubius progenitor of T. miscellus from Oakesdale. However, it seems unlikely that the bias is because of the selection of an inappropriate T. dubius genotype in this study as the other studies cited above as showing the same pattern have examined multiple T. dubius individuals. Our results also agree with those found in other species. In synthetic allopolyploids of Brassica, genomic changes occur more often in the paternal genome (Song et al. 1995). In natural Gossypium hirsutum (Flagel et al. 2008) and synthetic Arabidopsis allopolyploids (Wang et al. 2006), homeolog expression biases also tend to be in favour of the paternal genome. In maize, it has recently been shown that paternal genomic imprinting influences gene expression patterns in hybrids (Swanson-Wagner et al. 2009).

One mechanism by which homeolog loss may occur in T. miscellus is homeologous recombination, in which fragments of chromosomes can be lost. Ownbey (1950) observed multivalent formation in early generations of natural T. miscellus and rare patterns of isozyme variation in T. miscellus are consistent with homeologous recombination (Soltis et al. 1995). More recently, Lim et al. (2008) and Tate et al. (2009b) report multivalent formation in both natural and synthetic Tragopogon allopolyploids, along with unisomy, trisomy and reciprocal translocations in natural Tragopogon allopolyploids. Homoeologous recombination appears to have caused loss of chromosome fragments in resynthesized Brassica allopolyploids (Song et al. 1995; Gaeta et al. 2007). Another possible mechanism of homeolog loss is gene conversion, as has been found for rRNA genes in both T. miscellus and T. mirus (Kovarik et al. 2005; Matyasek et al. 2007).

High-throughput SNP discovery together with the genotyping of many natural T. miscellus plants of independent origin and F1 hybrids will enable us to examine genome-wide patterns of homeolog loss in this species. As SNPs are abundant in many species and easily detected (Gut 2001; Kwok 2001), they are excellent genetic markers for the generation of dense genetic maps that can support marker-assisted selection and association genetics programs, as well as inform on genome organization and function (Pavy et al. 2008; Slate et al. 2009). In T. miscellus, application of these markers will enable us to understand further the causes of homeolog loss in this allopolyploid, showing us whether or not homeolog losses occur in linkage groups – implying the loss of large fragments of chromosomes – or in small fragments scattered throughout the genome.

Utility of methods

In the space of a few months, we have been able to identify at high stringency 7782 homeolog-specific SNP markers within 2885 unique contigs in T. miscellus using next-generation sequencing. The number of homeologous genes available for study has therefore been increased by two orders of magnitude compared with previous studies using a ‘one gene at a time’ approach (Tate et al. 2006, 2009a; Buggs et al. 2009). The number of actual SNPs discovered is likely to be much higher than this, as we were likely over-stringent. We have developed working assays for 85 of these SNPs using Sequenom MassARRAY iPLEX technology. This high-throughput approach transforms our ability to study molecular evolution in T. miscellus.

The use of transcriptome sequencing with polyA purification is valuable for targeting functional genes for SNP discovery, as clearly shown by this study. However, there is the possibility that when these markers are then used to study the genome, polymorphisms will be discovered because of the presence of silent homeologs. In a few cases, we found this: six of 139 Sequenom SNP assays found polymorphisms in genomic DNA of diploid plants that had not been detected by Illumina sequencing in the transcriptome. This was an acceptably low level of polymorphism that was undiscovered by transcriptome sequencing. However, it should be noted that T. dubius and T. pratensis are mostly selfing species (Cook & Soltis 1999, 2000) with limited polymorphism in their introduced ranges in North America (Soltis et al. 1995; Symonds et al. 2009). Outcrossing species with high heterozygosity may pose more difficulties in analysis.

Sequenom MassARRAY Typer 4.0 Analyzer software uses a three-parameter model to calculate the significance of each putative genotype. This compares the size of peaks for the possible bases at each SNP site and the peak for the unextended primer. Where an assay is not working well, the nonextended primer will be found in greater abundance than the extended oligonucleotides. For genotypes which are called, the degree of confidence that can be placed on the call is described as ‘conservative’, ‘moderate’ or ‘aggressive’ in the software output. We found that four calls [three called at the ‘aggressive’ (lowest) level of confidence and one at the ‘moderate’ level] were not reliable because of failure to detect a base that was in fact present (a false negative). Manual examination of the mass-spectrometer trace in most cases allowed the call to be corrected.

This ‘false negative’ problem is likely to be due to the malfunction of these specific assays, rather than the reliability of ‘aggressive’ calls in general. Certain assays can function well in calling different bases in homozygotes, but in a heterozygote the primers bind preferentially to one allele, resulting in a false homozygote call. One reason why this occurs is if there is another SNP close to the SNP site that is being assayed (Liu et al. 2009). Preferential binding of primers can be assessed by genotyping more individuals that are expected to be heterozygous. If they all appear to be homozygous, then the Sequenom assay for that SNP should be rejected. We did this (see below) and found that these assays did not work correctly in multiple individuals. In addition, if we discard all aggressive Sequenom calls, we find that the correspondence between the Illumina and Sequenom data rises only slightly from 89 to 93%. This also suggests that there is not a general problem with the reliability of ‘aggressive’ calls.

This study also demonstrates that transcriptome profiling using Illumina sequencing is a useful method for identifying candidate homeologs for the study of homeolog loss in an allopolyploid species. This allows us to target these genes for developing SNP-typing assays, saving both time and money. The major cost in using Sequenom genotyping is the production of primers. Each SNP requires three primers: two for an initial amplification of the target region and one for the SNP-typing reaction. Once these primers have been synthesized, many samples can be SNP-typed at relatively low cost. We made use of this fact by screening an additional 94 individuals: a total of 87 diploid and T. miscellus plants from five natural populations, two 50-year-old herbarium specimens and five artificial crosses. Preliminary analyses of this survey allowed us to identify polymorphisms in the diploid plants and calculate allelic diversity. This data set showed repeatability of some homeolog losses in natural T. miscellus populations of different origins. Finally, we also found the first evidence for rare loss of alleles in F1 hybrids between T. dubius and T. pratensis. Robust analysis of this data set is ongoing.

Broader applicability

Transcriptome sequencing by 454 has many potential applications in ecology (Ellegren 2008; Elmer et al. 2010, Cannon et al. 2010, Renaut et al. 2010, Van Bers et al. 2010, Whittall et al. 2010, Wolf et al. 2010; Wang et al. 2009). It has been used for the de novo characterization of the transcriptome of the Glanville fritillary butterfly (Vera et al. 2008) and the Eucalyptus grandis genome (Novaes et al. 2008). Recent work in model organisms has used short-read sequencing to study differences in expression of SNP-containing alleles, for example in micro-RNAs in mice (Kim & Bartel 2009). Sequenom MassARRAY genotyping has been used to study allelic expression in hybrid maize (Stupar & Springer 2006) and levels of homeolog expression in allopolyploid cotton (Flagel et al. 2008, 2009; Chaudhary et al. 2009). This study demonstrates the effectiveness of a hybrid Illumina and 454 sequencing approach and Sequenom MassARRAY iPLEX genotyping to increase dramatically our ability to study the evolution of duplicated genes in natural allopolyploids such as T. miscellus. These methods could be applied to any organism, allowing efficient and cost-effective generation of SNP markers.

Acknowledgements

We thank William Farmerie and Regina Shaw of the Interdisciplinary Center for Biotechnology Research at the University of Florida for 454 sequencing and Michael Sandford for technical support. This work was supported by the University of Florida, and NSF grants MCB-034637, DEB-0614421, DEB-0919254 and DEB-0919348.

Conflicts of interest

The authors have no conflict of interest to declare and note that the funders of this research had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript.

Ancillary