Notice: Wiley Online Library will be unavailable on Saturday 27th February from 09:00-14:00 GMT / 04:00-09:00 EST / 17:00-22:00 SGT for essential maintenance. Apologies for the inconvenience.
Cynara cardunculus (2n = 2× = 34) is a member of the Asteraceae family that contributes significantly to the agricultural economy of the Mediterranean basin. The species includes two cultivated varieties, globe artichoke and cardoon, which are grown mainly for food. Cynara cardunculus is an orphan crop species whose genome/transcriptome has been relatively unexplored, especially in comparison to other Asteraceae crops. Hence, there is a significant need to improve its genomic resources through the identification of novel genes and sequence-based markers, to design new breeding schemes aimed at increasing quality and crop productivity. We report the outcome of cDNA sequencing and assembly for eleven accessions of C. cardunculus. Sequencing of three mapping parental genotypes using Roche 454-Titanium technology generated 1.7 × 106 reads, which were assembled into 38 726 reference transcripts covering 32 Mbp. Putative enzyme-encoding genes were annotated using the KEGG-database. Transcription factors and candidate resistance genes were surveyed as well. Paired-end sequencing was done for cDNA libraries of eight other representative C. cardunculus accessions on an Illumina Genome Analyzer IIx, generating 46 × 106 reads. Alignment of the IGA and 454 reads to reference transcripts led to the identification of 195 400 SNPs with a Bayesian probability exceeding 95%; a validation rate of 90% was obtained by Sanger-sequencing of a subset of contigs. These results demonstrate that the integration of data from different NGS platforms enables large-scale transcriptome characterization, along with massive SNP discovery. This information will contribute to the dissection of key agricultural traits in C. cardunculus and facilitate the implementation of marker-assisted selection programs.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
Unravelling the genetic basis of phenotypic traits is helpful for the design of breeding strategies directed at improved yield, biomass production and end-use quality. The current C. cardunculus genetic map (Portis et al., 2009) has been expanded significantly since the publication of the earliest AFLP-based version (Lanteri et al., 2006). A major step forward was taken by the generation of a large set of microsatellite (SSR) assays mined from EST sequences (Scaglione et al., 2009), part of which were used to construct SSR-based consensus genetic maps (Portis et al., 2012; Sonnante et al., 2011). The rapid development of sequencing technology has promoted the use of SNP (single nucleotide polymorphism) markers in large genotyping experiments as they are virtually unlimited (Ganal et al., 2009) and a wide range of high-throughput analysis platforms is available (Ragoussis, 2009). Sequence-based molecular markers are essential for performing comparative analysis across related species and providing anchoring features for scaffold ordering in genome sequencing projects. Next generation sequencing (NGS) technology, along with the necessary bioinformatics support, is designed to rapidly acquire very large amounts of sequence data (Delseny et al., 2010), and is thus well-suited for SNP discovery (Imelfort et al., 2009; Metzker, 2010). Among the NGS platforms suitable for this purpose are the 454 FLX-Titanium pyrosequencing, which generates relatively long reads that can facilitate accurate de novo assemblies (Bundock et al., 2009), and the Illumina GAIIx platform, which is highly cost-effective and thus is particularly suitable for re-sequencing and SNP discovery (Trick et al., 2009).
Sequencing the expressed portion of the genome is generally simpler than attempting whole-genome sequencing because of its lower complexity (Kaur et al., 2011). In orphan crop species, the cost of a whole-genome shotgun approach can be prohibitive, depending on its genome size and level of complexity (Parchman et al., 2010). Recently, a reduced genome complexity approach based on sequencing of RAD-tags (Scaglione et al., 2012) has been conducted on the C. cardunculus genome to generate a large set of genomic SNPs. Indeed, transcriptome deep sequencing together with extensive gene annotation represents a cost-effective and valuable resource for genetic, genomic and proteomic investigations (Brautigam and Gowik, 2010).
Herein, we report the sequencing and assembly of the C. cardunculus transcriptome and its annotation, including the identification of transposable element (TE) signatures and putative miRNA targets. In addition, we have undertaken a broad-based program of SNP discovery based on sequencing of three mapping population parents and eight other representative accessions of the three C. cardunculus taxa.
Roche 454-based cDNA sequencing was done of three mapping parents: var. scolymus‘Romanesco C3’, var. altilis‘Altilis 41’ and var. sylvestris‘Creta 4’ (Table 1, Figure 1). This produced 700 Mbp of raw sequence from approximately 1.7 × 106 reads (435 375 from ‘Romanesco C3’, 610 622 from ‘Altilis 41’ and 696 300 from ‘Creta 4’) (Table 2). Post-sequencing filtering reduced the total by only approximately 1%, resulting in 692.2 Mbp with a mean read length of 392 bp (Table 2). Quality trimming did not lead to significantly further reduction in data (data not shown). cDNA libraries from the remaining eight accessions (Table 1) were sequenced using an Illumina GAIIx platform, producing 6.9 Gbp of raw data (46.4 m paired-end reads) with a mean of 5.8 × 106 reads per accession. The data set was reduced to 6.7 Gbp following the removal of adaptor sequences and other contaminants, and it was further reduced to 6.2 Gbp after quality trimming. Consistently across samples, forward and reverse reads had final average length of 72.5 and 65.8 bp, respectively. Although aggressive trimming of sequences led to a sizable reduction in bases at the 3’-ends, a common problem in Illumina-derived reads (Metzker, 2010), this measure substantially reduced the risk of false SNP calls. The relative representation and quality of the multiplexed samples in each sequencing lane were evenly distributed (Table 3).
Table 1. The Cynara cardunculus germplasm panel used for sequencing
Cynara cardunculus taxon
F, mapping parent; C, obtained from seed retailers/developers; G, germplasm held at DIVAPRA, Plant genetics and breeding sector.
‘Violetto di Chioggia’
‘Blanco de Peralta’
‘Gobbo di Nizza’
Table 2. 454-derived sequencing and assembly. The output statistics were calculated following the removal of contaminating and adaptor sequences. ‘Romanesco C3’ is a globe artichoke, ‘Altilis 41’ a cultivated cardoon and ‘Creta 4’ a wild cardoon
*Results obtained by merging the three independent assemblies (Figure 1).
1 742 297
Phred20 bases (%)
167 Mb (90)
216 Mb (88)
227 Mb (87)
No. of contigs
Table 3. GAIIx (Illumina)-derived sequencing. A total of 46.6 × 106 reads raw reads were generated in two GAIIx lanes and 6.7 Gbp were retained after adaptor and contaminating sequence was removed
Number of raw reads (× 106)
Before quality clipping
After quality clipping
First mates (Mbp)
Paired mates (Mbp)
First mates (Mbp)
Paired mates (Mbp)
Avg. data loss %
‘Violetto di Chioggia’
‘Blanco de Peralta’
‘Gobbo di Nizza’
454 de novo assembly
The first phase of the 454 data assembly approach generated 37 622 contigs for ‘Romanesco C3’, 40 130 contigs for’Altilis 41’ and 42 837 contigs for ‘Creta 4’ with N50 contig lengths of 834, 761 and 772 bp, respectively, and mean coverage levels of 7.31X, 8.45X and 9.17X, respectively. For the ‘Romanesco C3’ assembly, a subset of 11 276 contigs resulted from the incorporation of a prior set of 28 641 Sanger ESTs (http://www.ncbi.nlm.nih.gov/dbEST). In the second phase, after contaminant removal by BLASTX analysis, the three datasets were merged into a set of 38 726 contigs. This ‘reference’ assembly spanned 32.7 Mbp and had a GC content of 42.1%. The mean contig length was 844.3 bp (N50: 951 bp) (Figure 2), which represents a 118% improvement in the coverage of the transcriptome from that described by Scaglione et al. (2009). As a result of the second assembly phase, 20 469 contigs were generated by merging at least two taxon-derived contigs from the first phase, consisting of a subset with a mean length of 1054 bp, while 5375, 6669 and 6213 remained as single taxon-derived contigs of var. scolymus, var. altilis and var. sylvestris, respectively (Data S1).
Open reading frame (ORF) and intron prediction. ORFs were associated with 38 567 sequences (99.6% of the total), defining 5.02 Mbp of 5′-UTR sequence, 21.17 Mbp of CDS and 6.42 Mbp of 3′-UTR sequence. BLASTX alignment succeeded in assigning ORFs to 587 contigs where ab initio prediction failed. The start codons were validated in 19 198 contigs via comparative analysis by BLASTX, whereas another 2037 start codons were labelled as putative. Based on the predicted orthologs or out-paralogs from Arabidopsis thaliana, and assuming the general conservation of exon/intron structure (Fedorov et al., 2002; Rogozin et al., 2003), it was possible to infer the positions of 100 102 putative exon-exon junctions in 22 764 of the contigs.
Multiple approaches were employed to estimate transcriptome representation and gene-level redundancy (e.g. splicing variants). Modelling our experiment on A. thaliana gene content, the 454 sequencing data was predicted to generate a total assembly length of 29.3 Mbp, scattered in 24 064 unigenes with average length 1216 bp and an overall coverage of 96% of the transcriptome. Simulation analyses of each of the three independent taxon-derived assemblies (Der et al., 2011) resulted in similar asymptotic trends, with end-point values of 31 439, 32 766 and 31 869 unigenes for ‘Romanesco C3’, ‘Altilis 41’, and ‘Creta 4’, respectively (Figure 3). To corroborate these results, we clustered the final contig set (38 k) using the same criteria that were aimed at collapsing gene variants (e.g. alternative splicing) using a relaxed CAP3-based assembly procedure. This resulted in a set of 29 830 unigenes representing a bona fide estimation of the gene content of C. cardunculus, which suggested that 23% of the 38 726 assembled transcripts were redundant by means of splicing variants.
The assignment of C. cardunculus genes to the chloroplast genome was based on similarity to those of lettuce and sunflower (Timme et al., 2007); this led to the categorization of 137 contigs. Of 84 annotated sunflower chloroplast genes, 80 were also present in C. cardunculus. Similarly, the grapevine (Vitis vinifera) mitochondrial genome (Goremykin et al., 2009) was used to identify putative transcripts of C. cardunculus mitochondria; of the 74 grape genes, we identified 52 contigs with high sequence similarity. Detailed information regarding non-nuclear transcripts, their representation and redundancy is provided in Data S2. The automated BLASTX analysis produced 711 220 hits against the NCBI nonredundant protein database, allowing for the retrieval of >1.3 million GO terms from several databases (80% from UniProtKB/TrEMBL). The Blast2GO pipeline successfully annotated 32 408 C. cardunculus transcripts, with 4399 falling below the threshold score and 1919 remaining unassociated with any GO term. Finally, a total of 184 469 GO terms were assigned, with an average of five per sequence. Conserved domains were identified by InterProScan (IPS, http://www.ebi.ac.uk/interpro), resulting in codes for 25 485 sequences and an additional 14 720 GO terms for the final annotation. A comprehensive table showing the full annotation is given in Data S3. Transcription factor activity was assigned to 1,398 transcripts distributed among 67 families. Over half of the transcripts involved in the regulation of transcription belonged to one of the bHLH, MYB, WRKY, C2H2, NAC, AP2-EREBP, bZIP, MYB-related or C3H families (Table 4, Data S4).
Table 4. Categorization of transcription factors (TF). These activities were identified using a set of relevant GO-terms, and confirmed by a Blastx search of the Plant Transcriptional Factors Database (PlnTFDB)
Avg. identity %
Avg. identity %
Across a set of 12 449 transcripts, 16 419 enzyme codes were retrieved from the Blast2GO database and mapped onto KEGG’s metabolic pathway encyclopedia (http://www.genome.jp/kegg/). The representative sample of C. cardunculus enzymes consisted of 1133 unique enzyme codes. A summary and separate map images for each pathway are provided in Data S5. A subset of 71 enzymatic activities known to be involved in phenylpropanoid synthesis was identified among 921 sequences; 21 of these were annotated at varying levels of redundancy in the core phenylpropanoid pathway (KEGG’s map: 00940), in which the synthesis of caffeoylquinic and di-caffeoylquinic acids (CQAs and dCQAs) takes place (Table 5). Pathogen resistance gene homologs were represented by 1860 transcripts. After validation by a PFAM search, 316 were retained on the basis of 214 matches with leucine-rich repeats, 79 matches with TIR motifs and 52 matches with NB-ARC motifs; 23 of the sequences carried both a TIR and an NB-ARC motif (Data S6).
Table 5. List of the enzymes involved in the phenylpropanoid synthesis pathway upstream of the CQAs and dCQAs. The upper section shows enzymes identified by automatic annotation, the lower section lists the best candidate based on homology searches
No. of annotated contigs
Automated annotation pipeline
caffeic acid 3-O-methyltransferase
Relict TE sequence in the transcriptome
When the Viridiplantae RepBase collection was interrogated, the search criteria identified 371 occurrences of TE relict sequence in the transcriptome (Data S7). There were discernible differences between the translated and non-translated regions with respect to both the identity of the TE family involved and the mean length of relict sequence present (Figure 4). LTR copia-like elements were most frequent (103), with a mean sequence length of 151 bp. DNA/Helitron and RC/Helitron sequences had comparable mean lengths of 142 and 156 bp, respectively. DNA/En-Spm and LTR/Gypsy-like relicts were all relatively shorter with mean lengths of 95 and 108 bp, respectively. The copia-like elements tended to leave larger insertions in the 5′-UTR (mean 252 bp) than in either the CDS (103 bp) or the 3′-UTR (98 bp). No such distribution pattern was evident for either the DNA/Em-Spm or the Gypsy-like elements. A manual inspection of read overlaps was carried out for regions reporting the presence of TE signatures. We ensured that either 454 or Illumina reads spanned across and beyond the repeated sequence during assembly, confirming that mis-assemblies did not occurred, as expected as an overlap-layout-consensus assembler was adopted.
Presumptive miRNA targets
Each sequence was scanned for the presence of recognition sites for known plant miRNAs. In total, target annealing sites for 302 miRNAs were located in 1043 transcripts (Table 6). miR414 was removed from the dataset because of its low level of conservation in genomes other than A. thaliana and rice and its poor precursor hairpin structure (miRBase, release 16). A Fisher’s exact test indicated some GO enrichment for miRNA targets (Table 7), particularly for the categories ‘immune system/defense response’ and ‘programmed cell death/apoptosis’, followed by ‘reproduction; development of anatomical structure’, ‘photosynthesis’, ‘trans-membrane receptor activity’ and ‘transcription factor activity’. A total of 40 sequences belonged to both the ‘programmed cell death/apoptosis’ and ‘immune system’-related paths of the DAG (directed acyclic graph); half of these were related to miR2109, a soybean miRNA (Wang et al., 2009) predicted to target 22 sequences across the whole C. cardunculus dataset. miR2109 was the third most abundant target site after mirR395 (26 transcripts) and miR2275 (29 transcripts). A complete classification is given in Data S8.
Table 6. Abundance of putative miRNA annealing sites in the Cynara cardunculus transcriptome. miRNA families occurring fewer than five times were incorporated into the category labelled ‘other’. A complete report is provided in Data S8
No. of targets
No. of targets
Table 7. The over-representation of GO-terms among the putative miRNA target transcripts. Fischer’s exact test (FDR <0.01) was used to assess statistical significance
No. of test
No. of references
Programmed cell death
Transmembrane receptor activity
Endoplasmic reticulum lumen
Innate immune response
Inositol hexakisphosphate binding
Immune system process
Response to singlet oxygen
5S rRNA binding
Anatomical structure development
Transcription factor activity
Read mapping and SNP calling
About 1.5 × 106 reads of the 454-derived reads were aligned to the reference contig set (38 726 contigs). This number was reduced to approximately 1.0 × 106 reads by removing those that showed more than one unique alignment, thereby lowering the risk of false SNP calls because of misalignment of paralog-derived reads or to redundancy resulting from splicing variants. The same procedure was repeated for the Illumina-derived reads, producing an alignment of approximately 60 × 106 reads paired ends. Resolving paired ends reduced this to a set of approximately 21 × 106 reads. An assembly based on >35 × 106 reads sequences was generated by merging the two sequence datasets, resulting in a median genome coverage of 96X with 26 990 reference contigs containing at least 20 mapped reads. Paired-ends contiguity was resolved in 75% of the alignments, confirming the reliability of the transcriptome assembly, though different taxa and libraries were used.
Reliable SNPs (Bayesian probability >95%) were detected at 195 400 sites across the set of eleven accessions (Data S9). The average SNP frequency was calculated at 1/167 bp, with a mean of five per contig. Each SNP site was interrogated by scoring for the presence of at least one accession-specific sequence. Sequence information was available from an average of nine accessions per SNP site, and a core subset of 57 125 SNPs showed coverage from all the samples. Sanger sequencing was performed on 153 randomly chosen heterozygous SNP loci from the 454-derived ‘Romanesco C3’ set, of which 138 (90%) were confirmed (data not shown).
The merging of the Illumina-derived reads (eight accessions) with 454-generated reads substantially increased the number of parent-specific SNPs that were identified (Figure 5). A SNP-calling simulation was carried out using only the 454 reads or using the merged dataset. From the former set, approximately 46 600 SNPs were discovered, while approximately 81 700 SNPs were found in the latter set, suggesting how the benefit of high coverage Illumina reads lead to confirm a larger amount of low coverage SNP calls in the 454 dataset. In the 454-sequenced samples, the identification of exclusively homozygous mutations was increased by 74% in the merged data set compared with the set of 454-derived reads alone (23 393 vs. 13 454), and the identification of mixed homozygous/heterozygous mutation sites showed 15% improvement (6682 vs. 5819). The best results were obtained for SNPs with exclusive allelic variants in heterozygous states, with an 89% increase in SNP discovery in the merged data set (51 579 vs. 27 360).
The distribution of minor allele frequencies is shown in Figure 6: an even distribution was reported with a slightly higher proportion of low-frequency variants, which can be ascribed to the presence of wild accessions in the sample panel. Cynara cardunculus is a highly heterozygous species, so that mapping to date has been based on a pseudo-testcross strategy. The presence of intra-accession allelic variation is therefore of particular interest. As expected by their shallower coverage, the 454-derived sequences produced a somewhat lower frequency of SNPs with successful heterozygous SNP calling (Figure 7). ‘Altilis 41’ was relatively the least heterozygous of the accessions (17 570 loci), as has been observed previously (Portis et al., 2005a, 2012), whereas ‘Romanesco Zorzi’ was the most heterozygous (43 387), followed by ‘Violetto di Chioggia’ (41 824). ‘Imperial Star’ had the lowest ratio of heterozygous variants among globe artichoke genotypes (13.5%). Overall, SNPs were most frequent in the 3′-UTR (one per 126 bp), followed by the CDS (one per 169 bp), and the 5′-UTR (1/265 bp); this same distribution of SNPs was also observed in Glycine soja (Kim et al., 2010). Of the SNPs located in the translated region, the balance between synonymous and non-synonymous substitutions was almost 1 : 1 (64 328–60 903). Premature stop codons were associated with 1949 SNP sites, and the relative frequency of transitions to transversions was 63%–37%. The allelic diversity within each accession was evaluated by considering ‘private’ alleles (Table 8). The wild cardoon accessions carried the greatest number of allelic variants, of which 61% were synonymous and 39% non-synonymous. The cultivated cardoon accessions, with a lower amount of private SNPs, had a similar trend except for ‘Altilis 41’, which showed an increased level of non-synonymous private SNPs, similar to those of ‘Romanesco C3’ and ‘Romanesco Zorzi’. With respect to private alleles in the homozygous state, wild and cultivated cardoon accessions shared a similar frequency, but globe artichoke accessions did not: ‘Romanesco C3’, ‘Violetto di Chioggia’ and ‘Romanesco Zorzi’ each harboured a small proportion of homozygous private alleles (respectively 2.2%, 5.9% and 7.1%),whereas ‘Imperial Star’ harboured 43.3% (Table 8).
Table 8. Categorization of private SNPs. Private SNPs per taxa were considered for shared allelic variants within one taxon and missing in any other. Homozygous fixed SNPs consider both coding/non-coding regions
In CDS (%)
Non synonymous (%)
In UTR (%)
Homozygous fixed (%)
Private SNPs (per accession)
‘Violetto di Chioggia’
‘Gobbo di Nizza’
‘Blanco de Peralta’
Private SNPs (per varietas)
NGS technology and the development of genomic resources in non-model plant species are significant from both agricultural and ecological points of view (Der et al., 2011; Kaur et al., 2011; Novaes et al., 2008; Trick et al., 2009). Much of the effort to date has focused on EST sequencing in an attempt to identify functional important genes, whereas large scale SNP characterization in plants has concentrated mostly on major crops. Globe artichoke plays a key role in the agricultural economy of the Mediterranean basin and its cultivation area, as well as its global economic significance, is increasing by progressive market expansion towards China, Africa and South America (FAOSTAT, 2009). However, unlike other species belonging to the Asteraeae family (like sunflower and lettuce), which have been the targets of extensive genomic investigations, globe artichoke can be considered an orphan crop species and its genome/transcriptome relatively unexplored. The rapidly falling cost of DNA sequencing brought about by the development of NGS technology has now allowed us to fill this gap and, in this instance, the generation of new comprehensive genomic/transcriptomic resources was expedited. The 454 Titanium (454 Life Science, Branford, CT) platform was applied to the genotypes of each of the three C. cardunculus taxa (globe artichoke, cultivated and wild cardoon) to produce a reference transcriptome, whereas the use of the Illumina GAIIx platform allowed us to identify SNP variation by re-sequencing five globe artichoke, two cultivated cardoon and one wild cardoon accession(s) (Table 1), demonstrating the complementary utility of the two platforms.
De novo assembly
Ideally, sequence assembly and mapping requires both a pre-existing scaffold of overlapping reads, and that each locus can be uniquely identified. In EST data, the latter requirement is compromised by splicing variants and paralogy. As a result, some fine-tuning of the assembly/mapping procedures becomes necessary to avoid misplacement of sequences and the false calling of SNPs. Herein, a two-step approach was adopted; the first aiming to remove chimeric reads and spoilers (i.e. mis-assembled reads that interrupt the progression of the overlap-layout-consensus algorithm) so that data loss in the second step (merging of the various datasets) could be minimized. Moreover, we didn’t apply a de Bruijn graph-based approach as the high frequency of both heterozygous and homozygous allelic variants in this outbreeding and highly heterozygous species would originate unresolved bubbles, leading to truncated transcript assemblies. Thus, taking into account the diversity of the datasets and the possibility of chimeric reads in different phases of the data elaboration, we opted for an extensively parameterized assembler. The identification of a set of unigenes and the simulated assembly curves for each of the datasets suggested an even representation of transcripts across the three subspecies. As a result, it was possible to estimate that some 23% of the transcripts were splicing variants, which matches closely the estimated frequency of 21.5% in A. thaliana (TAIR10, http://www.arabidopsis.org). Further support for this estimate was provided by the observation that around 25% of the Illumina-derived reads aligned with multiple reference transcripts (data not shown). The normalization procedure adopted was effective in identifying the genes encoding a number of enzymes involved in phenylpropanoid synthesis. Only approximately 20 000 read ends (0.02% of the total sequence acquired) were mapping to the RuBisCO large subunit transcript sequence.
The 454-based assembly succeeded in identifying non-nuclear transcripts by exploiting the expected high level of sequence conservation reported between the chloroplast genomes of the Asteraceae species lettuce and sunflower (Timme et al., 2007). In contrast, as expected, a reduced number of transcripts were identified using the sequence information from the grapevine mitochondrial genome, as its sequence complement is rather variable, even between closely taxa, as a result of gene loss over time (Adams and Palmer, 2003; Palmer et al., 2000). With respect to the nuclear component, a number of genes within the phenylpropanoid synthesis pathway were identified.
The categorization of transcription factors is particularly important for gaining an understanding of the control of gene expression. Their number in the C. cardunculus genome appears to be similar to that in the grapevine genome (PlnTFDB, http://plntfdb.bio.uni-potsdam.de), which lends further support to the capture of most of the transcriptome within the present sequence dataset. Many of the genes associated with pathogen resistance (Resistance Genes Analogues – RGAs) are of the NBS-LRR type, which form a large gene family in plant genomes. They are well represented in the C. cardunculus transcriptome, allowing for the possibility of undertaking a candidate gene approach in a positional cloning strategy for genes determining pathogen resistance. At last, stress responsive transcription factors (e.g.: MYB, NAC and WRKY), secondary metabolites (CQAs) related genes and RGA gene families will be selected for SNP marker development to saturate the C. cardunculus maps we previously developed (Portis et al., 2009, 2012), facilitate QTL analyses and foster the implementation of breeding strategies.
Transposable elements and miRNA targets
Certain LTR retrotransposons appear to target the gene space (Hirochika et al., 1996) and their contribution to gene evolution and expression has been widely explored (Bennetzen, 2000). In maize, the phenotypic consequences of TE activity have been widely investigated (Marillonnet and Wessler, 1997; Scott et al., 1996; Wessler et al., 1995) and several genes have shown to harbor relict TE sequence. About 100 instances of copia-like elements were noted in the C. cardunculus transcriptome; their mean length of 151 bp was almost the same as that of the DNA/and RC/Helitrons present (142 and 156 bp, respectively), whereas the DNA/En-Spm and LTR/Gypsy-like sequences were somewhat shorter (95 and 108 bp, respectively). The mean insert size of copia-like elements present in the 5′-UTR fraction (252 bp) was rather longer than their mean length in the CDS (103 bp) or 3′-UTR (98 bp), but this pattern did not extend to either the DNA/Em-Spm or the Gypsy elements. Overall, this suggests that Gypsy-like, copia-like and En/Spm-like TEs have had a measurable influence on the evolution of the host’s gene space. The presence of such long copia-like element relicts in the 5′-UTR regions may imply that the constraints imposed on transposition are TE-dependent. Some useful information emerged from the analysis of conserved miRNA binding sites. Although requiring experimental validation, the sequence data suggested an important contribution of miR2109, which targets several TIR type NBS-LRR R genes (Wang et al., 2011).
SNP frequency and diversity
SNP frequencies in the C. cardunculus transcriptome appear to be comparable to that found in the heterozygous grapevine whole genome sequence (Velasco et al., 2007) and among Citrus species ESTs (Jiang et al., 2010). Within the UTRs, the frequency also matched that obtained in tomato expressed sequences (Jimenez-Gomez and Maloof, 2009), while it was markedly higher than that present in the coding region (approximately 2/kb). This discrepancy may reflect either the greater tolerance of non-synonymous substitutions in the heterozygous state, or merely is an ascertainment bias because of the analysis of a larger germplasm panel (which also included an accession of a wild relative).
The merging of the 454- with the Illumina-derived data was a critical step towards SNP calling, in particular for rare heterozygous SNPs, which frequently could not be identified from the 454 data alone because of low read counts. The heterozygous variants are needed to develop test-cross markers (Figure 5). The 90% validation rate of SNPs called from the 454-derived sequence confirmed the reliability of the platform (Pavy et al., 2006). Despite the lower coverage of the 454 reads, the relative abundance of homozygous and heterozygous SNPs was maintained in the merged dataset. The globe artichoke cultivar ‘Imperial star’ had a high number of private alleles in the homozygous state, which likely reflects its development by directed breeding, possibly involving crosses with exotic material and the use of enforced self-fertilization. In contrast, farmer selection in clonally propagated varietal types has maintained a wide range of within genetic variation and heterozygosity (Portis et al., 2005a, 2012). The relatively high occurrence of private alleles among the wild cardoon accessions may be because of their high genetic diversity (Portis et al., 2005b), although the number of accessions analysed was just two. However, private allele richness remained consistent across 454- and Illumina-sequenced samples, suggesting a very low probability of false discovery events.
The combination of two NGS platforms has allowed for both an extensive characterization of the C. cardunculus transcriptome, and the rapid discovery of a large number of SNPs. The de novo assembly of the transcripts produced a catalogue of transcription factors, miRNA targets and putative R genes, which together represent an invaluable resource for upcoming genomic and genetic studies in this species. As many of the identified SNPs reside in homologs already mapped in lettuce and sunflower, they will facilitate comparative genetics across the Asteraceae family.
The adoption of stringent alignment and calling criteria has generated a robust EST-SNP database, the size of which is more than sufficient for any conceivable genotyping application. The identification of common and rare allelic variants will facilitate the design of diversity or association studies, whereas the inclusion of mapping parents in the germplasm panel has provided a ready-made source of informative assays for genetic mapping purposes. The availability of such a large number of sequence-based markers, in a format allowing for high throughput genotyping, offers many opportunities to conduct genetic analyses of key agricultural traits in a Mediterranean crop species which has end-uses both as a food, as a source of nutraceuticals and as a possible biomass producer. In particular, to address future breeding activities, it will be of interest to focus on specific classes of genes as stress-responsive transcription factors and RGAs. The outcomes of such analyses should facilitate the implementation of marker-assisted selection for the improvement of globe artichoke and cultivated cardoon.
Plant materials and RNA extraction
RNA was extracted from the eleven C. cardunculus accessions listed in Table 1. Leaf and root material was harvested from field-grown plants of ‘Romanesco C3’, ‘Altilis 41’ and ‘Creta 4’, rinsed in sterile water, frozen in liquid nitrogen and stored at −80 °C. Seeds of the other eight accessions were pot-grown in sand soaked in a hydroponic solution. After 5 weeks, the leaves and roots were collected, frozen in liquid nitrogen and stored at −80 °C. Total RNA was extracted from 1 g of each tissue using the TRIzol reagent (Invitrogen, Carlsbad, CA). The resulting RNA was quantified and controlled for purity using a Nanodrop 2000c spectrophotometer (Thermo Scientific, Wilmington, DE). The RNA from the root and leaf of each accession was mixed in equimolar amounts and its integrity checked using bioanalyzer® 2100 (Agilent Technologies, Santa Clara, CA).
Construction of normalized cDNA libraries and sequencing
To obtain the full length cDNAs for transcriptome analysis, we used SMART cDNA synthesis technology to generate cDNAs by following a protocol from the Clontech Creator SMART cDNA library construction kit (Clontech Laboratories, Mountain View, CA). This requires an oligo-dT primer that anchors the polyA tail of mRNA to prime the cDNA synthesis process. However, for 454 pyrosequencing, mononucleotide runs reduce sequence quality and quantity because of excessive light production and crosstalk between neighboring cells (Margulies et al., 2005). To counteract this problem, we employed a ‘broken chain’ short oligo-dT primer (primer sequence: 5′-AAGCAGTGGTATCAACGCAGAGTCGCAGTCGGTACTTTTTTCTTTTTTV -3′, V = A, G or C) to prime the poly(A) tail of mRNA during first strand cDNA synthesis (Meyer et al., 2009). Approximately 1.5 μg of total RNA was reverse-transcribed to first-strand cDNA using this method. Following the first strand cDNA synthesis, double-stranded (ds) cDNA synthesis was performed using Phusion polymerase (New England Biolabs, Ipswich, MA) with a hot start of 98 °C for 30 s, followed by 18 cycles of 98 °C for 7 s, 66 °C for 20 s and 72 °C for 4 min. The ds-cDNA PCR product was purified using a QIAquick PCR Purification column (Qiagen, Waltham, MA). To reduce the abundance of common transcripts, the cDNA library was normalized using a TRIMMER-DIRECT cDNA normalization kit (Evrogen, Moscow, Russia). Approximately 800 ng of purified ds-cDNA was used as the starting material for normalization. A mixture of 0.25 and 0.5 μL DSN normalization tubes were used for the first and second amplifications. After normalization, cDNA was fragmented to 500–800 bp fragments by sonication and size-selected to remove small fragments using AMpure SPRI beads. Then the fragmented ends were polished and ligated with adaptors. The optimal ligation products were selectively amplified and subjected to two rounds of size selection including gel electrophoresis and AMpure SPRI bead purification (Lai et al., 2012). Each of the three libraries was sequenced using a half picotitre plate in a 454 FLX Titanium device, following the manufacturer’s sequencing protocol. For Illumina sequencing of eight samples, we used the SMART technology to generate full length cDNA and normalized the cDNA based on TRIMMER cDNA normalization kit described above. The ds-cDNAs then were treated following the standard multiplexing genomic DNA shotgun library preparation kit to generate the genomic DNA library for sequencing. Sequencing was carried out in two Illumina GAIIx lanes (four multiplexed samples each), following the manufacturer’s protocol.
The 454-derived reads were first screened to remove adaptor/primer leftovers by inspecting the dataset with the SeqClean perl script (compbio.dfci.harvard.edu/tgi/software/). Sequences were quality clipped using a sliding window analysis starting from the last 6 bp on both the 5′- and the 3′- end; when average quality fell below Phred20, two external bp were discarded and the window shifted accordingly. Illumina reads were de-multiplexed using the Illumina Pipeline software, and the sequences were then screened for the presence of residual adaptor leftovers, using the ShortRead package (http://www.bioconductor.org). The subsequent quality clipping was performed as described for 454-derived sequence.
De novo assembly of 454 reads
The assembly of the 454-derived reads was achieved using a two-step procedure implemented in the MIRA assembler v3.2.0 (Chevreux et al., 2004). An overview of the analysis pipeline is provided in Figure 1. The reads from each of the three libraries were initially assembled independently. For the ‘Romanesco C3’ library, a hybrid assembly was carried out by including 36 321 Sanger-ESTs (http://www.ncbi.nlm.nih.gov/dbEST) developed by the previous efforts of the Compositae Genome Project (CGP, http://compgenomics.ucdavis.edu). Reads which remained as singletons were discarded. Each contig was prefixed by ‘Scolymus_’, ‘Altilis_’ or ‘Sylvestris_’. In the second step, the contigs were treated as Sanger reads and were merged. The different parameters applied in each of the two assembly phases are listed in Table 9. Both the post-merging contigs and those remaining as singletons were included in the transcript dataset. Contigs obtained by merging across subspecies were prefixed ‘C_cardunculus_joined_’. Potential contaminant-derived contigs were identified by Blastx against the Viridiplantae protein database (NCBI) and removed. A representative catalogue of unigenes was identified by additional clustering based on the CAP3 algorithm (Huang and Madan, 1999), applying a 50% criterion for maximum overhangs length and a 95% identity cut-off. Validation of the method was obtained through an independent cluster analysis based on CD-HIT-EST software (Huang et al., 2010; data not shown), applying a 95% identity cut-off, –r and –g options, and all other parameters left as default. The assembly performance was assessed by bootstrapped EST sampling (Der et al., 2011) applying 1,000 permutations.
Table 9. Parameters configuration of the two assembly phases. For missing parameters, we employed the default values in MIRA v. 3.2.0
First assembly phase
Second assembly phase
*SANGER_SETTINGS in the first assembly phase have been only applied to the hybrid assembly of ‘Romanesco C3’ 454 reads with Sanger-ESTs from CGP database.
Location of ORFs and introns
HMM-based ESTScan3 software (Iseli et al., 1999) was adopted for ORF identification (score threshold set with b = 0.8), using a customized HMM matrix, which was created by means of scripts available in the prot4EST package (Wasmuth and Blaxter, 2004). In this regard, a simulated transcriptome was obtained by parsing the Blastx output against the TAIR9 dataset (http://www.arabidopsis.org/), setting as thresholds an E-value of e−15 and the number of alignment gaps as no more than six. Where ESTScan failed to generate a result, Blastx output parsing was used instead. Putative introns were inferred from an additional Blastx analysis against the TAIR 9 database, by parsing alignment offsets with genomic coordinates of exons. Intron positions were reported on the C. cardunculus contigs, using a perl script available at int-citrusgenomics.org/usa/ucr/files/intron_finder.zip. The default E-value cut-off of e−57 was considered to imply orthology.
The prediction of assembly performance was assessed using the simulation tool ESTcalc (fgp.huck.psu.edu/NG_Sims/ngsim.pl), considering the sequencing output obtained after the removal of adaptors. The permutation-based estimation of EST clustering was accomplished using a perl script developed by Der et al. (2011), combining MIRA-assembly output with CAP3 clustering (1000 permutations).
Non-nuclear transcripts were identified via Blastn analysis (E-value <e−35, nucleotide identity >85%). The identification of chloroplast sequences within the C. cardunculus sequence dataset was based on the sunflower chloroplast sequence, and that of mitochondrial sequences from the grapevine sequence. The contigs were submitted to an annotation pipeline using a Blast2GO java interface (Conesa et al., 2005). First, all sequences were submitted to the NCBI nr protein database using the Blastx algorithm (E-value <e−3, 20 best-hits recorded). Gene names and GI (gene identifier) were recorded, while PIR-code (Non-redundant Reference Protein Database) by reference to either UniProt, SwissProt, TrEMBL, RefSeq, GenPept or PDB. The GIs were used to search the GO database. The annotation of the transcripts was performed by applying the Blast2Go-embedded formula, setting a threshold score of 55. ANNEX augmentation was performed to retrieve implicit GO-terms by certain’ GO Molecular Functions’. InterPro scan was carried out to collect additional GO annotations on the basis of conserved functional domains. Enzyme codes were retrieved from GO tables and mapped onto the KEGG pathway.
The identification and categorization of transcription factors were obtained on the basis of the GO terms ‘regulation of transcription’, ‘transcription factor activity’, ‘transcription activator activity’, ‘regulation of transcription’, ‘DNA-dependent’, ‘transcription repression activity’, ‘negative regulation of transcription’, ‘positive regulation of transcription’ or ‘nucleic acid binding’, and these sequences were then used for a Blastx search against the Plant Transcription Factor Database (Perez-Rodriguez et al., 2010).
R gene candidates were identified by means of Blastp analysis against the Plant Resistance Genes database (PRGdb) (Sanseverino et al., 2010). Positive hits were validated via HMMERv3 (hmmer.janelia.org/software) software, searching against PFAM hidden Markov models for NB-ARC (PF00931), TIR (0PF1582) and several leucine-rich repeat (PF00560, PF01462, PF01463, PF12799, PF07723, PF08191, PF08263, PF07725, PF01816, PF12534) motifs (Finn et al., 2010); the ‘gathering scores’ method was applied to retrieve positive hits.
The presence of TE relict sequence was inferred by a RepeatMasker scan, using a local rmBlastn analysis against the Viridiplantae RepBase_v16.01. Only alignments with an identity >75%, a minimum length of 60bp and a rmBlastn score >225 were retained. Transcribed sequences were subjected to psRNATarget analysis against miRBase-16 (Griffiths-Jones et al., 2008). A maximum expectation of 2.5 was adopted, allowing a maximum energy to unpair the target site of 25, and considering 17 bp upstream and 13 bp downstream of the TE sequence. Inhibition of translation was considered for mismatches in the 9th to 11th mature miRNA nucleotides. Any enrichment of GO terms was identified by comparing the putative miRNA targets against the whole transcript dataset by means of the Gossip package implemented in the Blast2Go suite: a Fisher’s exact test was applied collecting terms with P values <e-4 and false discovery rate <0.01.
Read mapping and SNP calling
Reads produced by both sequencing platforms were aligned against the reference contig set. The alignment software package Mosaik (bioinformatics.bc.edu/marthlab/Mosaik) was used to process the 454- and Illumina-derived reads separately. The parameters for the 454 reads were: ‘m=unique, hs=13, act=36, bw=41, mmp=.07, mmal, minp=.90’, while Illumina paired-end reads were aligned with ‘hs=11, act=25, bw=25, m=all, mmp=.07, mmal, minp=.90, ls=150’ and paired-ends were sorted with ‘afl, ci=.9985, sa, rmm’. Alignments were merged and assembled with MosaikMerge and MosaikAssembler modules, respectively. Automated SNP calling on aligned reads was carried out by gigaBayes (bioinformatics.bc.edu/marthlab/GigaBayes) software. SNPs were identified on the basis of the following search parameter set: ‘ploidy = diploid, QRL = 30, CAL = 3, PSL = 0.95’. A customized perl script (provided by JM Jimenez-Gomez, Max Planck Institute for Plant Breeding Research, Cologne, Germany) was used to locate the placement of the SNPs into the CDS, the 5’-UTR or the 3’-UTR, to assess whether the SNP was synonymous, to predict ORFs and to position the SNPs. The full SNP data set has been organized into a relational database, which is available upon request.
This research was supported by: (i) U.S. National Science Foundation grant DBI-0820451 to LHR, SJK, and ZL, (ii) and by the Italian MIPAAF through the CYNERGIA and CARVARVI research projects. We thank Joan Wong for help with the manuscript revision.