• Open Access

The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads

Authors


(e-mails deyholos@ualberta.ca, wangj@genomics.org.cn and gane@ualberta.ca).

Summary

Flax (Linum usitatissimum) is an ancient crop that is widely cultivated as a source of fiber, oil and medicinally relevant compounds. To accelerate crop improvement, we performed whole-genome shotgun sequencing of the nuclear genome of flax. Seven paired-end libraries ranging in size from 300 bp to 10 kb were sequenced using an Illumina genome analyzer. A de novo assembly, comprised exclusively of deep-coverage (approximately 94× raw, approximately 69× filtered) short-sequence reads (44–100 bp), produced a set of scaffolds with N50 = 694 kb, including contigs with N50 = 20.1 kb. The contig assembly contained 302 Mb of non-redundant sequence representing an estimated 81% genome coverage. Up to 96% of published flax ESTs aligned to the whole-genome shotgun scaffolds. However, comparisons with independently sequenced BACs and fosmids showed some mis-assembly of regions at the genome scale. A total of 43 384 protein-coding genes were predicted in the whole-genome shotgun assembly, and up to 93% of published flax ESTs, and 86% of A. thaliana genes aligned to these predicted genes, indicating excellent coverage and accuracy at the gene level. Analysis of the synonymous substitution rates (Ks) observed within duplicate gene pairs was consistent with a recent (5–9 MYA) whole-genome duplication in flax. Within the predicted proteome, we observed enrichment of many conserved domains (Pfam-A) that may contribute to the unique properties of this crop, including agglutinin proteins. Together these results show that de novo assembly, based solely on whole-genome shotgun short-sequence reads, is an efficient means of obtaining nearly complete genome sequence information for some plant species.

Introduction

Flax (Linum usitatissimum) has been used by humans for at least nine millennia and possibly for as long as 30 millennia (van Zeist and Bakker-Heeres, 1975; Zohary and Hopf, 2000; Kvavadze et al., 2009). Varieties of this species are now cultivated for either fibers or seed, on approximately 3 Mha in over 20 countries (http://faostat.fao.org/). The bast fibers that grow in the outer layers of the flax stem are made up of remarkably long cells that are rich in crystalline cellulose and have a very high tensile strength (Mohanty et al., 2000). These fibers are the source of linen textiles, and their use in composite materials is an area of active research (Bodros et al., 2007). The seed of flax (i.e. linseed) produces oil that is rich in unsaturated fatty acids, especially α-linolenic acid (C18:3), polymers of which are used in linoleum, paints and other finishes. Consumption of the oil or seed has been reported to have beneficial effects on cardiovascular health and in the treatment of certain cancers and inflammatory diseases (Singh et al., 2011). Health benefits are derived from both the α-linolenic acid and other components of the seed, including lignans such as secoisolariciresinol diglucoside (SDG), which is an antioxidant and the precursor of several phytoestrogens (Toure and Xu, 2010). Flax seed is also used in animal feed to increase levels of α-linolenic acid in meat or eggs (Simmons et al., 2011). In total, flax is a source of at least three classes of commercial products: fiber, seed oil, and nutraceuticals. Increased understanding of the genes affecting the quality and yield of these bioproducts will contribute to crop improvement for flax and other oil- or fiber-producing species.

In addition to its economic uses, flax has several characteristics that make it an interesting subject for scientific inquiry. Among these are the diversity of the flax family and its relatives. The genus Linum comprises an estimated 180 species distributed over six continents (McDill et al., 2009). The genus displays interesting variation in evolutionarily significant characteristics such as breeding system (distylous and homostylous species are known), flower color (ranging from white to yellow, red and blue), pollen morphology (tricolpate or multiporate) and edaphic endemism (i.e. restriction to specific soil types). The family Linaceae has impressive ecological and morphological diversity, with more than 270 species in 14 genera ranging from tiny, serpentine-endemic annuals in the California chaparral (the genus Hesperolinon) to canopy trees in South American black-water forests (Hebepetalum) and tendrillate vines in Indonesian jungles (Indorouchera). Linum and the Linaceae have recently been the subject of molecular phylogenetic investigations using chloroplast markers, the nuclear ribosomal internal transcribed spacer (ITS), and retrotransposon sequences (McDill et al., 2009; Fu and Allaby, 2010; McDill and Simpson, 2011; Smykal et al., 2011). The Linaceae family belongs to the order Malpighiales, of which three species have been fully sequenced: Populus trichocarpa (black cottonwood, Salicaceae), Ricinus communis (castor bean, Euphorbiaceae) and Manihot esculenta (cassava, Euphorbiaceae) (Tuskan et al., 2006; Chan et al., 2010; http://phytozome.net. The phylogenetic relationships among families in the Malpighiales are not entirely resolved (Wurdack and Davis, 2009), but recent estimates suggest that the Linaceae diverged as a lineage distinct from other families of Malpighiales approximately 100 MYA (Davis et al., 2005). The lineage leading to Arabidopsis thaliana, whose genome is commonly used in comparative genome analyses, is thought to have diverged from the lineage leading to Malpighiales between 100 and 120 MYA (Tuskan et al., 2006).

Flax is also of general scientific interest because some varieties of flax (e.g. Stormont Cirrus) exhibit rapid changes in nuclear DNA content (Cullis, 1981b, 2005). Differences in nuclear C-value of up to 15% have been reported among first-generation progeny of self-pollinated individuals that were subjected to specific temperature or fertilizer regimes (Evans et al., 1966). Altered copy numbers of many types of genetic elements contribute to the observed changes in genome size, including rRNA genes and an unusual insertional element named LIS-1 (Goldsbrough et al., 1981; Chen, 1999). Increased genomic sequence information for flax may help to further elucidate the basis of genome plasticity.

Patterns of genetic inheritance in L. usitatissimum are typical of a diploid. L. usitatissimum has = 15 chromosomes, which is common in certain lineages of the genus. Other species of Linum have been reported to have n = 8, 9, 10, 13, 14, 18 or 27 chromosomes (Rogers, 1982), consistent with a history of repeated genome duplications (polyploidy) and possible instances of aneuploid reductions or increases. Published estimates of the nuclear DNA content of flax range from C = 538 Mb to C = 685 Mb, although reassociation kinetics analyses have produced estimates as low as C = 350 Mb (Evans et al., 1972; Cullis, 1981a; Marie and Brown, 1993). The reassociation kinetics studies also show that approximately 50% of the genome is composed of low-copy-number sequences, with 35% highly repetitive sequences, and the remaining approximately 15% in the middle-repetitive fraction, which typically represents transposable elements (Cullis, 1980). Genomic sequence resources for L. usitatissimum in public databases include 286 294 ESTs and 80 339 BAC end sequences in GenBank (http://www.ncbi.nlm.nih.gov/nucleotide/). The majority of these sequences were obtained from the cultivar CDC Bethune (Venglat et al., 2011). A linkage map based on SSRs (Cloutier et al., 2009) and a 368 Mb BAC physical map of CDC Bethune (Ragupathy et al., 2011) have been published recently. Microarray gene expression and proteome studies of flax have also been reported (Lynch and Conery, 2000; Roach and Deyholos, 2007; Fenart et al., 2010).

To further increase the available sequence resources for flax, we have performed whole-genome sequencing of the CDC Bethune cultivar of Linum usitatissimum, which is a highly inbred, elite linseed cultivar grown on the majority of flax acreage in Canada (Rowland et al., 2002). CDC Bethune does not exhibit any of the phenotypic plasticity that accompanies the rapid genome change observed in Stormont Cirrus. We used a whole-genome shotgun approach to sequence CDC Bethune, using de novo assembly of exclusively short reads, a strategy that has previously proven effective in vertebrates (Li et al., 2010) and date palm (Phoenix dactylifera) (Al-Dous et al., 2011). The primary objective was to characterize the gene space of flax, in support of gene discovery, marker development and reverse genetics activities that are already underway. The analysis of the WGS assembly is presented here.

Results

To obtain a whole-genome shotgun (WGS) assembly of flax, we extracted DNA from axenically grown, etiolated seedlings of a linseed cultivar of L. usitatissimum, CDC Bethune, to produce size-selected sequencing libraries based on five insert sizes ranging from 300 bp to 10 kb in length (Table 1). Each library was used as a template in paired-end or mate-pair sequencing reactions in an Illumina GAII genome analyzer, producing a total of 34.98 Gb of reads that ranged between 44 and 100 bp in length. After filtering low-quality sequences, the remaining 25.88 Gb were used in assembly. Using flow cytometry, we estimated the nuclear DNA content of CDC Bethune to be 2N = 2C = 0.764 pg (inline image pg), corresponding to a haploid nuclear genome of 373 Mb. The sum of the WGS sequence reads obtained therefore represented approximately 94× coverage (raw reads) and 69× coverage (filtered reads) of the nuclear genome of flax.

Table 1. Summary statistics of the paired-end libraries used in WGS sequencing
Insert sizeRead lengthTotal data (Gb)Filtered data (Gb)Sequence depth (×)
300 bp44/75/10012.879.8126.3
500 bp44/75/10015.4110.5428.3
2 kb443.192.777.4
5 kb441.821.614.3
10 kb441.691.153.1
Total 34.9825.8869.4

SOAPdenovo is a de Bruijn graph-based assembly program designed to use short WGS reads as its input (Li et al., 2009). Applying this software to filtered Illumina reads, we generated a de novo assembly in which 50% of the assembly (N50) was contained in 4427 contigs of 20 125 bp or larger (Table 2). All contigs were further assembled into 88 384 scaffolds (≥100 bp), of which 132 scaffolds containing 50% of the assembly were 693.5 kb or larger (N50 = 693.5 kb). The contigs and scaffolds each contained a total of 302 and 318 Mb of unique sequence, respectively. Therefore, as a proportion of the nuclear DNA content estimated by flow cytometry, the WGS assembly contained an equivalent of 81% genome coverage in contigs and 85% genome coverage in scaffolds. The modal sequencing depth in the assembly was 45× (Figure S1). The distribution of assembly unit lengths is shown in Figure 1(a). The longest contig in the assembly was 151.8 kb and the longest scaffold was 3.09 Mb. Gaps within scaffolds ranged in length from 1 to 10 807 bp, with a median length of 148 bp (Figure 1b). The WGS contigs contained 40% G+C bases, which is higher than most other sequenced dicot genomes (Figure 1c). The same G+C proportion was observed in an analysis of 80 340 capillary-sequenced BAC end sequences from the same cultivar of flax, thereby confirming that the observed G+C bias is not merely an artifact of the sequencing technology used (Ragupathy et al., 2011).

Table 2. Statistics for the WGS assembly
 ContigScaffold
Length (bp)NumberLength (bp)Number
N90 248717 96682 101601
N80 754811 535225 440372
N70 11 6828349349 810261
N60 15 7416126497 164186
N50 20 1254427693 492132
Longest151 8073 087 368
Total size302 170 579318 247 816
Total number (≥100 bp)116 60288 384
Total number(≥2 kb)19 1601458
Figure 1.

 Genome assembly results.
(a) Assembly units (i.e. contigs or scaffolds) were sorted by descending length, and the cumulative DNA content was plotted as a function of the total number of assembly units. The total genome size of 373 Mb (measured by flow cytometry) is represented by a dashed line, and the N50 and N90 assembly units are indicated by filled and open circles, respectively.
(b) Distribution of gap length within the scaffold assembly.
(c) G+C content as a proportion of total bases measured in 500 bp bins throughout the flax (Lus) WGS assembly, compared to genomes of nine other fully sequenced dicots: Arabidopsis thaliana, Cucumis sativa, Glycine max, Ricinus communis, Populus trichocarpa, Malus domestica, Manihot esculentum, Medicago truncatula and Vitis vinifera.
(d) Distribution of the length of intervening sequence in scaffold/BES alignments. The length of scaffold DNA between paired BES that aligned to the same scaffold was calculated, and is plotted as a frequency distribution for each of the pairs of aligned BES.

To assess the accuracy of the de novo assembly, we compared it to multiple independent sources of flax DNA sequence information. First, we obtained the complete set of 286 852 Linum usitatissimum ESTs from the National Center for Biotechnology Information (NCBI). Most of these accessions (93%) were from the same cultivar (CDC Bethune) that was used for WGS sequencing. Using blat software (Kent, 2002), we found that 248 927 of the 286 852 NCBI ESTs (87%) aligned to at least one WGS scaffold, with ≥95% coverage of the EST length and ≥95% sequence identity (Table 3). When ESTs were filtered by length (removing some low-quality sequences), we observed that at least 93% (217 875/234 547) of the NCBI flax ESTs with a length ≥400 bp aligned to the WGS scaffolds. We also aligned the WGS scaffolds with fosmids and BACs that had been sequenced using capillary or pyrosequencing methods, and that represented a total of 1.3 Mb of non-redundant genomic sequence (Table 4 and Figure S2). Within aligned blocks, sequence identity was typically ≥99%. However, globally, only between 81.4% and 99.2% of BAC or fosmid sequences aligned with the WGS scaffolds. Moreover, three BACs aligned best to non-contiguous segments of different scaffolds, indicating that the long-range continuity of the WGS assembly was limited in its accuracy. We further characterized the long-range accuracy of the WGS assembly by comparing the WGS scaffolds to 80 340 previously described BAC end sequences (BES) obtained from paired, Sanger dideoxy sequencing reads of two CDC Bethune BAC libraries (Figure 1d) (Ragupathy et al., 2011). Of the 40 099 BES accessions that aligned to masked WGS scaffolds with high stringency (≥99% sequence identity, ≥400 bp alignment length and ≥95% BES coverage), 9668 were pairs (i.e. from opposite ends of the same BAC) in which both members aligned to the same WGS scaffold. In each case, the length of scaffold region between the aligned BES pairs was calculated, and the distribution of these lengths is plotted in Figure 1(d). The median length of scaffold sequence was 133.4 kb, and the mean length was 140 kb. These results are consistent with the 135 and 150 kb mean insert lengths reported for the two original CDC Bethune BAC libraries. However, a minority of alignments (1448/9668; 15%) had intervening scaffold sequence lengths that deviated greatly from the mean (<100 kb or >200 kb), indicating that additional studies are required to improve the long-range accuracy of the assembly.

Table 3. Alignment of capillary-sequenced ESTs to WGS scaffolds
EST coverage in alignment (%)Alignment identity (%)EST length (bp)Aligned ESTs (number)Total ESTs queried (number)ESTs aligned (%)
≥90≥95>0262 748286 85291.6
≥95≥95>0248 927286 85286.8
≥90≥95≥400225 579234 54796.2
≥95≥95≥400217 875234 54792.9
Table 4. Alignment of capillary-sequenced BACs and fosmids to WGS scaffolds
BAC/fosmidBAC/fosmid size (kb)Matching scaffoldScaffold size (kb)Alignment (kb)Identity (%)Gaps (%)
JX174446 214.61562281.8214.695.63.1
RL10-contig00001190.5107560190.597.71.5
JX174448 184.9281933.6119.2981.4
23781.864.299.20.6
JX174447 180.11036248.323.5942.5
332127.3156.594.93.3
JX174445 179.1155809.212695.42.2
17793453.1951.7
JX174444 172.5961083.7172.591.85.3
JX174449 130.3931463.5130.393.95.2
HQ902252 39.81486346.539.891.27.7
JN133299 34.6465878.634.681.42.1
JN133300 31.5898104531.598.61.1
JN133301 26.21856140.626.286.76.2

Repetitive elements

To identify various types of middle-repetitive DNA sequences within the CDC Bethune WGS assembly, the complete set of WGS scaffolds was submitted to repeatmasker version 3.3.0 (http://www.repeatmasker.org/), which identified 15.8 Mb of sequence (5.2% of a total of 302 Mb WGS contigs) as having significant similarity to the subjects in RepBase (http://www.girinst.org/repbase/). The set of repeats identified by this homology-based analysis was expanded substantially by application of de novo repeat identification methods, resulting in a total of 73.8 Mb (24.4% of WGS contig assembly) being annotated as sequence with similarity to mobile elements (Table 5). As expected, retroelements were the most common mobile elements found in the genome (20.6% of WGS contigs), and these were represented primarily by LTR type retroelements (18.4%) followed by long interspersed elements (LINEs) (2.2%). All DNA transposons represented a much smaller proportion of the WGS contig sequences (3.8%), and most were Mutator-type elements (2.1% of WGS contig sequence). The proportions of the various types of mobile elements are similar to what has been reported in other full-sequenced dicot genomes of comparable size. The unusual 5.8kb LIS-1 insertion sequence (GenBank accession AF104351) that has been described in other varieties of flax was not found in its entirety in the CDC Bethune assembly; the largest LIS-1 fragment detected was 543 bp (Chen et al., 2005).

Table 5. Transposable elements identified in the flax WGS assembly
ClassOrderSuperfamilyNumber of elementsElements percentage (%)Sequence occupied (bp)Sequence percentage of transposable elements (%)Sequence percentage of genome (%)
  1. LTR, Long Terminal Repeat; DIRS, Dictyostelium Intermediate Repeat Sequence; PLE, Penelope; LINE, Long Interspersed Nuclear Element; SINE, Short Interspersed Nuclear Element; TIR, Terminal Inverted Repeat.

RetrotransposonsLTR Copia 89 95138.3129 594 88240.089.79
Gypsy 72 62630.9325 123 12734.028.31
Unclassified 27971.19902 2981.220.30
DIRS DIRS 20.001020.000.00
PLE Penelope 5480.2330 2140.040.01
LINE RTE 110.006180.000.00
L1 27 63211.776 684 2439.052.21
SINE Unclassified 10.00490.000.00
DNA transposonsTIR Tc1-Mariner 1910.0838 2310.050.01
hAT 79353.381 986 5222.690.66
Mutator 21 1249.006 320 4248.562.09
P 20.00960.000.00
Harbinger 13840.59344 8760.470.11
En-Spm/CACTA 83303.552 372 5923.210.79
Helitron Helitron 21540.92434 8590.590.14
Unclassified Unclassified 950.0489810.010.00
Total  234 783100.007 384 2114100.0024.29

Non-coding RNAs

Non-coding RNAs comprise the majority of cellular RNAs and play an important role in translation (tRNA and rRNA), synthesis of the translational apparatus (snRNA), and gene regulation (miRNA). We searched the WGS assembly for sequences characteristic of these four types of non-coding RNAs (Table 6) (Griffiths-Jones et al., 2003; Nawrocki et al., 2009). At least 297 putative miRNA precursor loci with similarity to known miRNAs were identified (Table S1), and more than 1000 copies were found of both tRNAs and 5S RNAs. These frequencies are similar to what has been reported in other species of comparable genome size. However, an unexpectedly low number of 45S RNAs were identified, with as few as 11 copies of the 5.8S RNA component of the 45S locus found in the WGS assembly.

Table 6. Non-coding RNA species identified in the flax WGS assembly
TypeSub-typeCopy numberMean length (bp)Total length (bp)Percentage of genome
miRNA 297120.4935 7850.01
tRNA 110074.7582 2240.03
rRNA 110090.99100 0880.03
18S65186.9718 6530.00
28S14121.4317000.00
5.8S11123.1813550.00
5S101077.6078 3800.02
snRNA 462120.0555 4620.02
CD box264102.5727 0790.01
HACA box41119.8549140.00
Splicing157149.4823 4690.01

Plastid sequences

We searched the WGS assembly for fragments with similarity to plastid DNA. We identified 31 scaffolds that consisted primarily of chloroplast DNA, but we did not detect a complete assembly of the plastid genome among the flax WGS scaffolds. However, many occurrences of plastid-derived sequences were observed within the scaffolds of nuclear DNA. Using BLASTN (e-value ≤10−4 or lower) (Altschul et al., 1997), we compared the WGS assembly to a draft of the L. usitatissimum chloroplast genome that was independently sequenced (J. McDill, unpublished results). This search revealed 1356 regions with significant similarity to flax chloroplast DNA, ranging from 23 to 5899 bp in length in 416 scaffolds. In total, these regions represented an approximately 89% coverage of the draft flax chloroplast genome (including only a single copy of the chloroplast inverted repeat region). The size distribution of these regions conforms to the distributions of NUPT (nuclear plastid DNA) observed in other species (Figure S3) (Richly and Leister, 2004).

Gene prediction

To predict the locations of protein-coding genes within the repeat-masked WGS assembly, we used two de novo hidden Markov model-based gene-finding programs: GlimmerHMM and Augustus (Majoros et al., 2004; Stanke et al., 2008). The output of these programs was integrated with WGS alignments of flax ESTs and other empirically established plant transcript sequences to produce a consensus gene set using glean (Elsik et al., 2007). The consensus gene set contained a total of 43 484 genes, with one transcript model per gene. The mean and median gene lengths [coding sequences (CDS), no introns] of the consensus gene set were 1200 and 987 bp, respectively. A total of 40 971 genes had a predicted length ≥300 bp. The distributions of CDS, exon and intron lengths are shown in Figure 2 for the predicted flax genes and genes from five other representative genomes.

Figure 2.

 Gene prediction.
Distribution of the length of (a) mRNA, (b) CDS, (c) exon and (d) intron features within the complete set of predicted genes for flax and five other representative dicots.

To assess the coverage and accuracy of the gene prediction, we compared the filtered set of 43 484 predicted flax genes (which excluded most transposable elements) to the complete set of flax ESTs from the NCBI databases (Table 7). We aligned the CDS regions of the predicted genes to the 286 852 NCBI EST sequences using BLASTN. We observed that 248 210/286 852 NCBI flax ESTs (86.5%) aligned to one or more WGS CDS (BLASTN e-value ≤10−20). Because many ESTs potentially included UTRs, which are not part of CDS datasets, we repeated the comparison, using only the 239 220 NCBI EST accessions that were ≥400 bp long. Of the length-filtered NCBI flax ESTs, 222 846/239 220 (93.2%) aligned to one or more CDS from the predicted flax WGS proteome (BLASTN e-value ≤10−20) (Table 7). Therefore, the coverage of ESTs within the genes predicted from the WGS assembly was very high. Conversely, 28 783/43 484 CDS sequences (66%) aligned to one or more NCBI ESTs (BLASTN e-value ≤10−20) (Table 6). The relatively low proportion of predicted WGS CDS sequences that aligned to flax ESTs may indicate that the coverage of flax genes in EST databases is incomplete and/or that a considerable proportion of the predicted CDS sequences do not represent components of real transcripts. Both possibilities were examined in the subsequent analyses described below.

Table 7. Alignment of predicted WGS CDS with ESTs, predicted genes and domains of other species
QueryQuery (number)SubjectAlgorithmParametersQueries with matchesPercentage
  1. Lus, Linum usitatissimum; Ptr, Populus trichocarpa; Ath, Arabidopsis thaliana.

NCBI Lus ESTs286 852WGS CDSBLASTN1.00E-20248 21086.5
NCBI Lus ESTs239 220WGS CDSBLASTN1E-20; length >400222 84693.2
WGS CDS43 484NCBI Lus ESTsBLASTN1.00E-2028 78366.2
WGS CDS43 484NCBI nrBLASTP1.00E-0538 92089.5
WGS CDS43 484Ptr CDSBLASTP1.00E-0539 74691.4
WGS CDS43 484Ath CDSBLASTP1.00E-0538 87689.4
Ptr CDS40 688WGS CDSBLASTP1.00E-0535 13586.4
Ath CDS27 416WGS CDSBLASTP1.00E-0523 57586.0
WGS CDS43 484Pfam-AHMMer31.00E-0533 45977.0

We compared the predicted WGS proteins with peptide sequences in the NCBI databases, and separately with peptide sequences from the well-characterized Arabidopsis and poplar genomes (Tuskan et al., 2006; Swarbreck et al., 2008). We observed that 89.5% (38 920/43 484) of flax WGS proteins aligned to one or more proteins from the NCBI nr protein database (http://www.ncbi.nlm.nih.gov/protein), and nearly the same proportion aligned with Arabidopsis (38 876/43 484; 89.4%) or poplar proteins (39 746/43 484; 91.4%; BLASTP e-value ≤10−5). Because sequences from different species were not expected to be identical, a lower level of stringency was used in these cross-species comparisons than was used in the comparisons of flax ESTs to flax CDS sequences. Overall, the high proportion of predicted proteins from the flax WGS that aligned with proteins from Arabidopsis, poplar and the NCBI nr database indicates that the majority of genes predicted in the flax WGS are legitimate protein-coding sequences. We also made converse comparisons to determine what proportion of Arabidopsis and poplar proteins were represented by at least one flax protein. We found that 86.0% (23 575/27 416) of Arabidopsis proteins and 86.4% (35 135/40 688) of poplar proteins aligned to one or more predicted flax proteins (BLASTP e-value ≤10−5). This level of coverage is consistent with the alignment of predicted poplar and Arabidopsis proteins reported after sequencing of the poplar genome (Tuskan et al., 2006).

Functional annotation of predicted proteins

The Pfam-A database provides profile hidden Markov models of over 13 672 conserved protein families (Finn et al., 2010). Approximately 79% of known proteins contain one or more Pfam domains. We used PfamScan/HMMer3 to identify Pfam domains within the predicted genes of the flax WGS assembly (ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/; Eddy, 2011). As shown in Table 7, 77% (33 459/43 484) of the proteins in the flax WGS assembly contained one or more Pfam domains (Pfam-A 26.0, e-value ≤10−5). Almost all of the 5312 Pfam-A domains were found to occur at the same frequency in the predicted flax genes as in genes from either Arabidopsis or poplar (Figure 3). Only 47 domains showed significantly different abundance in a comparison between Arabidopsis and flax, and 53 domains were different in abundance between flax and poplar (χ2 adjusted FDR q-value <0.05; Table S2). Among these, 12 domains (excluding domains assumed to represent unmasked transposable elements) were significantly more abundant in flax than in at least one of either Arabidopsis or poplar (Figure 4). Some of the notable Pfam domains that were enriched in flax included HSP70 (heat shock protein 70, PF00012), GRAS transcription factors (PF03514) and BSP (basic secretory protein, PF04450), BSP is found in peptidases involved in defense. Agglutinin domains (PF07468), which are also presumably involved in defense, were found in 17 predicted flax genes. This agglutinin domain is not found in any known proteins of either Arabidopsis or poplar (although they are present in a diverse range of other species; Figure 4). Conversely, domains that were under-represented in the predicted flax genes included many domains of unknown function (DUFs), some classes of leucine-rich repeats, wall-associated kinases, F-box-associated domains and transcription factors. The functional and ecological relevance of these observations is currently being tested.

Figure 3.

 Abundance of Pfam-A domains in predicted proteins from flax and other whole-genome sequences.
For each domain, the number of proteins with at least one occurrence of the domain is plotted. Domains that are significantly more abundant in one species (χ2 FDR q-value <0.05) are plotted as triangles; domains with similar abundance in each species (χ2 FDR q-value ≥0.05) are plotted as circles. Comparison of (a) flax with A. thaliana and (b) flax with P. trichocarpa. Full details of the domains and their frequencies are given in Table S2.

Figure 4.

 Pfam-A domain frequency in flax and other species.
All domains that were more abundant (χ2 FDR <0.05) among predicted proteins of flax compared to either A. thaliana or P. trichocarpa are shown (labels in bold). A subset of the domains that were significantly less abundant in flax compared to either species is also shown. Additional species for which whole-genome sequence is available are shown for comparison. The width of the colored region indicates the number of genes containing a given Pfam-A domain within each species. A different scale is used for each domain, and the number to the right of each bar indicates the total number of genes represented by that bar. Redundant occurrences of the same domain within the same gene are counted only once. BSP, basic secretory protein; Barwin, PF00967; alginate lyase, PF08787; agglutinin, PF07468; Self_Incomp_S1, plant self-incompatibility protein S1 (PF05938); SCRL, plant self-incompatibility response protein (PF06876); HSP70, heat shock protein 70 (PF00012); GRAS, GRAS family transcription factor (PF03514); transferase, PF02458; DUF4409, domain of unknown function (PF14365); BBE, berberine and berberine-like (PF08031); PPR_2, PPR repeat family (PF13041); Oxidoreq_q1, oxidoreductase (PF00361); MP, viral movement protein (PF01107); DUF659 (PF04937); DUF577 (PF04510); DUF4371 (PF14291); DUF4220 (PF13968). Ath, Arabidopsis thaliana; Bdi, Brachypodium distachyon; Cpa, Carica papaya; Cre, Chlamydomonas reinhardtii; Csa, Cucumis sativa; Gma, Glycine max; Lus, Linum usitatissimum; Mdo, Malus domestica; Mes, Manihot esculenta; Osa, Oryza sativa; Ppa, Physcomitrella patens; Ptr, Populus trichocarpa; Rco, Ricinus communis; Sbi, Sorghum bicolor; Vvi, Vitis vinifera; Zma, Zea mays.

Analysis of sequence divergence in duplicated genes

Because the number of genes predicted in the flax WGS assembly was higher than some other plant species (e.g. A. thaliana), we analyzed the gene set for evidence of recent genome duplication. We performed two separate analyses, using as input either the capillary-sequenced flax ESTs from the NCBI GenBank database or the predicted CDS sequences from the WGS assembly (http://www.ncbi.nlm.nih.gov/nucleotide/). We identified 3644 duplicate gene pairs among the flax ESTs and 9920 pairs within the WGS CDS sequences, and analyzed their divergence as nucleotide substitutions per synonymous site per year (Ks). In both datasets, a distinct peak in the Ks distribution was observed at Ks = 0.15 (Figure 5). Based on this modal Ks value, a time of divergence of 5–9 MYA was inferred depending on whether a synonymous mutation rate per base of 1.5 × 10−8 or 8.1 × 10−9 is assumed (Koch et al., 2000; Lynch and Conery, 2000).

Figure 5.

 Rate of substitutions per synonymous sites (Ks) within duplicated gene pairs from (a) capillary-sequenced flax ESTs and (b) predicted CDS sequences from the flax WGS assembly.

Discussion

Assembly quality

The pace of plant genome sequencing is accelerating, aided by advances in instrumentation and software. The clone-by-clone sequencing strategy used to sequence the Arabidopsis genome has been supplanted by whole-genome shotgun (WGS) approaches. WGS de novo assemblies have typically relied on at least some component of long sequencing reads (e.g. from dideoxy or pyrosequencing), rather than the shorter (but less costly) reads produced by Illumina or SOLiD sequencers (Applied Biosystems). However, a WGS de novo assembly of date palm was recently reported that was based entirely on 36–84 bp Illumina reads, with an N50 scaffold size of 30 kb (Al-Dous et al., 2011). Here we describe de novo WGS assembly of flax based exclusively on short reads (Table 1). The descriptive statistics for the flax WGS assembly (Table 2), with scaffold N50 = 693 kb, 85% genome coverage in scaffolds, and coverage of up to 97% of the gene space as defined by ESTs (Table 3), provide further evidence of the utility of the short-read de novo WGS approach in plants. The occurrence of transposable elements (Table 5), NUPTs (Figure S3) and most types of non-coding RNA (ncRNA) (Table 6) was typical of other plant genomes sequenced to date (Richly and Leister, 2004; Tuskan et al., 2006). As a whole, our analyses showed that the coverage and accuracy of the assembly was high, especially at the scale of individual genes (Tables 3 and 7), but with some limitations regarding long-range accuracy of the assembly (Table 4 and Figure S1). Further improvements in assembly accuracy will require the use of additional sources of data such as the physical maps recently published for CDC Bethune (Ragupathy et al., 2011). Nevertheless, we conclude that the short-read de novo WGS approach is a highly efficient strategy for characterization of the nearly complete gene space of flax.

Predicted genes and protein domains

The masked WGS assembly supported prediction of a consensus set of 43 484 genes (Table 7). This gene number is comparable to what has been reported in soybean, poplar, and rice, but is substantially larger than Arabidopsis, Brachypodium or cucumber (Velasco et al., 2010). The length distribution of exons of predicted flax proteins was consistent with other species, although flax intron and mRNA length tended to be shorter than other species in the comparison (Figure 2). The proportion of predicted flax genes that aligned with Arabidopsis genes (89.4%), that contain Pfam domains (77%), and that align with one or more proteins in the NCBI nr database (89.5%) is high (Table 7). Moreover, the distribution of conserved protein domains in flax is nearly equivalent when compared to either Arabidopsis or poplar (Figure 3), and 93.2% of known ESTs of at least 400 bp align with high stringency to the predicted CDS sequences. Together these data support our conclusion that the predicted WGS genes are highly accurate and that the WGS assembly (and subsequent gene prediction) produced very good coverage of the gene space of flax. Therefore, the relatively low coverage of the predicted WGS flax genes by ESTs (66.2%) is probably due primarily to incomplete gene representation in the ESTs, rather than systematic errors in gene prediction.

Only a few functional groups appeared to be significantly over-represented in flax compared to other species (Figure 4 and Table S2). The increased in abundance of domains that are often related to plant defense responses (e.g. lectins and basic secretory proteins), and the decreased abundance of leucine-rich repeat domain-containing proteins, raise the possibility of additional biotic stress response strategies in flax. The increased abundance of two domains nominally associated with self-incompatibility (PF05938 and PF06876; Figure 4) is probably not functionally relevant to pollination control in this self-compatible species, as we note that one of these domains (SCRL) is also highly represented in Arabidopsis.

Evolution of the flax genome

Analysis of the rates of divergence of duplicated genes (Figure 5) indicated that a whole-genome duplication event probably occurred 5–9 MYA in the lineage of L. usitatissimum. The same Ks distribution was observed in both CDS sequences predicted from the Illumina WGS shotgun assembly (Figure 5b) and capillary-sequenced ESTs (Figure 5a), demonstrating that the distribution is not an artifact of the sequencing or assembly technique used. A recent whole-genome duplication is also consistent with the large gene number predicted for flax, compared for example to Arabidopsis (Table 7), the relatively high number of duplicated genes (9920) identified, and with the numbers of chromosomes reported in L. usitatissimum and related species (Figure 6). L. usitatissimum and the closely related L. bienne both have = 15 chromosomes, while sister clades containing species such as L. grandiflorum and L. lewisii have = 8 and = 9 chromosomes, respectively. Therefore, the evolution of L. usitatissimum probably involved chromosome doubling followed by loss of one or more chromosomes. Comparison of patterns of gene family expansion in L. usitatissimum and L. bienne and other congeners is ongoing.

Figure 6.

 Phylogram of flax (L. usitatissimum) and related species showing haploid chromosome number (in black or shaded circle).
The star indicates the approximate placement of the whole-genome duplication that we infer occurred in the common ancestor of L. bienne and L. usitatissimum. The tree and dated nodes are based on data from McDill et al. (2009).

In conclusion, de novo assembly of exclusively short, paired reads obtained by approximately 94× raw coverage of the flax genome was an efficient method of obtaining near-complete and highly accurate information about the gene space of this crop. A limitation to this approach is the accuracy of the long-range assembly, as demonstrated by imperfect alignments to BAC sequences. Nevertheless, the gene-scale data are sufficient for further investigations into the function and evolution of this species.

Experimental Procedures

Plant growth and DNA isolation

Seeds of the L. usitatissimum linseed cultivar CDC Bethune were obtained from its breeder, Gordon Rowland (Department of Plant Sciences, University of Saskatchewan, Saskatoon, Canada). These seeds were the F19 generation of the original cross of variety NorMan × genotype FP857 (Rowland et al., 2002). All generations were self-pollinated, and the final eight generations were obtained by single-seed descent. Seeds were surface-sterilized and grown axenically in polycarbonate vessels. DNA was extracted from etiolated seedlings using a variation of a published protocol (Dellaporta et al., 1983), in which the concentration of NaCl in the extraction buffer was 1 m. Upon resuspension of DNA in TE buffer (50 mm Tris and 10 mm EDTA, pH 8), 12 μl of 10 mg/ml RNase A was added, and the mixture was allowed to incubate at 37°C for 1 h. Subsequently a phenol/chloroform/isoamyl alchohol (25:24:1) extraction was performed to remove impurities. DNA was precipitated with 0.1 volumes of 3 m NaOAc and 0.6 volumes of isopropanol, washed in 70% EtOH, resuspended in 400 μl TE (10:0.1), and subsequently re-precipitated with 0.1 volumes of 3 m NaOAc and two volumes 95% EtOH. DNA was washed and resuspended in TE. Aliquots of the same DNA preparation were used for genome sequencing and fosmid library construction.

Flow cytometry

DNA content measurements (Galbraith et al., 1983) were performed on ten CDC Bethune seedlings germinated from the same source of seeds used for WGS and fosmid sequencing. The measurements were performed using an Accuri C6 flow cytometer (Becton, Dickson http://www.bdbiosciences.com), with propidium iodide/RNase as the nuclear DNA stain (Bharathan et al., 1994). The nuclear DNA contents were calibrated to an external standard, Raphanus sativa cv. Saxa (Dolezel et al., 2007), with a DNA content of 1.11 pg/2C nucleus. A single preparation of the standard was run five times to provide a mean DNA fluorescence peak position that was then used as the reference for the DNA content calculations.

Whole-genome shotgun sequencing and assembly

Seven paired-end sequencing libraries were constructed according to the manufacturer’s protocols (Illumina, http://www.illumina.com/) at BGI-Shenzen. The nominal insert sizes of the libraries were 300 bp, 500 bp, 2 kb, 5 kb and 10 kb. For libraries with an insert size between 200 and 500 bp, library preparation proceeded as follows: genomic DNA fragmentation, end repair, adapter ligation, size selection and PCR. For the longer (≥2 kb) mate-paired libraries, DNA circularization, digestion of linear DNA, fragmentation of circularized DNA, and purification of biotinylated DNA were performed before adapter ligation. After the libraries were constructed, the template DNA fragments of the constructed libraries were hybridized to the surface of flow cells, amplified to form clusters, and then sequenced using an Illumina GAII genome analyzer.

Raw reads were filtered to remove fragments that contained >2% unknown bases or poly(A) structure, or for which >60% of bases for the large insert-size library data and 40% of bases for the short-insert data showed a quality score ≤7. Reads were also removed if more than 10 bp aligned to the adapter sequence (allowing a mismatch of ≤3 bp). The filtered reads were assembled using SOAPdenovo as previously described (Li et al., 2009). The assembly comprised three stages. In the first stage, short-insert library data were split into k-mers, a de Bruijn graph was constructed and simplified, and then the k-mer path was connected to generate the contig file. In the second stage, all the usable reads were re-aligned onto the contig sequences, then the number of shared paired-end relationships between each pair of contigs was calculated and used to identify consistent and conflicting paired-ends used to construct the scaffolds. Finally, the paired-end information was used to retrieve read pairs that had one end mapped to a unique contig and the other located in a gap region, and then a local assembly was performed for these collected reads to fill the gaps.

Fosmid library construction and sequencing

Fosmid and BAC libraries were constructed from flax variety CDC Bethune, and selected clones were sequences as described previously (Ragupathy et al., 2011; Roach et al., 2011). Fosmids were deposited in NCBI GenBank under accessions HQ902252 and JN133299JN133301.

Annotation of transposable elements and ncRNA

Putative transposable element regions of the flax genome were identified using repeatmasker version 3.3.0 (http://www.repeatmasker.org/), RMBlast was used as the search algorithm with a Smith–Waterman cut-off of 225 (this cut-off was used for all RepeatMasker analyses), and a database with annotated de novo repeats was combined with the Viridiplantae database of transposable elements (update 20110920) for use as a library for comparison to the shotgun assembly. To annotate the masked bases in their respective transposable element superfamilies, a custom Perl script was used (kindly provided by Robert Hubley, Institute for Systems Biology, Seattle, WA). De novo identification of transposable elements was performed using RepeatScout (Price et al., 2005), PILER (Edgar and Myers, 2005), LTR_finder (Xu and Wang, 2007) and LTR_STRUC (McCarthy and McDonald, 2003). RepeatScout was used under the default parameters. Repeats identified by RepeatScout were filtered for low complexity using tandem repeats finder (Benson, 1999) and nseg (Wootton and Federhen, 1993), and the filtered library was used to mask the flax genome. Repeats with fewer than ten hits in the genome were eliminated from the library. For PILER-DF analysis, the full genome was compared to itself using pals (part of the PILER implementation) with default parameters. Families of dispersed repeats were created using a minimum family size of three members and a maximum length difference of 5% between all family members. The consensus sequence for each family was created after aligning the sequences using muscle (Edgar, 2004). LTR transposable elements were found using LTR_finder with option –w2 to obtain a table output that was parsed to obtain the sequences corresponding to the elements. LTR_STRUC (McCarthy and McDonald, 2003) was used under default parameters.

Four types of non-coding RNAs were detected by searching the unmasked scaffold assemblies using tRNAscan-SE (version 1.23) (Lowe and Eddy, 1997), and BLAST alignment with Rfam database (release 9.1) (Griffiths-Jones et al., 2003) and Arabidopsis thaliana and Oryza sativa full-length rRNAs, followed by infernal (version 0.81) (Nawrocki et al., 2009).

Gene prediction and annotation

Gene prediction was performed using a combination of de novo and homology-based methods. De novo predictions were performed on the repeat-masked WGS assembly using augustus (version 2.5.5) (Stanke et al., 2008) and GlimmerHMM (version 3.0.1) (Majoros et al., 2004). Flax ESTs were also aligned to the WGS assembly using blat (blat-34, identity ≥0.98, coverage ≥0.98) to generate spliced alignments, which were linked according to their overlap using pasa (Haas et al., 2003). Plant proteins of other species were also mapped to the WGS assembly using tblastn (Blastall 2.2.23, Altschul et al., 1997) with an e-value cut-off of 10−5. The aligned sequences as well as their query proteins were then filtered and passed to genewise (version 2.2) (Birney et al., 2004). Gene models from de novo predictions and EST and protein alignments were integrated using glean (Elsik et al., 2007) to produce a consensus gene set.

Duplicate gene pair analysis

All analyses were performed independently on two datasets. First, we downloaded all 286 852 ESTs available for L. usitatissimum in November 2011 from the NCBI database (http://www.ncbi.nlm.nih.gov/nucleotide/) and assembled them into unigenes using cap3 with default settings (Huang and Madan, 1999). Second, we used all 43 484 predicted CDSs from the flax genome project.

Analyses were performed with the DupPipe pipeline (Barker et al., 2008) using default parameters. In short, sequences were clustered into gene family members by parsing the results of a discontiguous MegaBlast (Zhang et al., 2000; Ma et al., 2002) to retain those that showed at least 40% sequence similarity over a minimum of 300 bp. Reading frames for each sequence pair were identified by comparison with all available plant protein sequences of GenBank (Wheeler et al., 2007) using BlastX (Altschul et al., 1997). Best-hit proteins were then paired with each nucleotide sequence at a minimum cut-off of 30% sequence similarity over at least 150 sites. Sequences that did not meet these criteria were removed from further analyses. Each gene was aligned against its best-hit protein using genewise 2.2.2 (Birney et al., 2004) to determine the reading frame. Amino acid sequences for each gene were estimated by using the highest scoring Genewise DNA–protein alignments. Amino acid sequences for each duplicate pair were then aligned using muscle 3.6 (Edgar, 2004), and used to align their corresponding DNA sequences using RevTrans 1.4 (Wernersson and Pedersen, 2003). Ks values (synonymous substitution rates) for each duplicate pair were calculated using the maximum-likelihood method implemented in codeml of the paml package (Yang, 1997) under the F3-4 model (Goldman and Yang, 1994). Further cleaning of the dataset was then performed to remove duplication events that may bias the results. To reduce the possibility that identical genes were represented in the dataset, but were missed by TGICL clustering (Pertea et al., 2003) due to alternative splicing, all Ks values from one member of a duplicate pair with Ks = 0 were removed. Further, to reduce the multiplicative effects of multi-copy gene families on Ks values, simple hierarchical clustering was used to construct phylogenies for each gene family, identified as single-linked clusters, and the nodal Ks values were calculated.

Data access

WGS assembly data were deposited at NCBI GenBank under GenomeProject ID #68161 and Sequence Read Archive accession SRA038451. Fosmid and BAC sequences were submitted to GenBank as accessions HQ902252, JN133299JN133301 and JX174444JX174449. Additional resources, bulk data downloads and a Gbrowse (Stein et al., 2002) implementation are available at http://www.linum.ca and http://phytozome.org.

Acknowledgements

Funding was provided by Genome Alberta/Genome Canada, the Government of Alberta, Alberta Innovates Technology Futures-iCORE, Institut National de la Recherche Agronomique France and the Indian Council of Agricultural Research.

Ancillary