• Open Access

Large-scale resource development in Gossypium hirsutum L. by 454 sequencing of genic-enriched libraries from six diverse genotypes

Authors


Summary

The sequence information has been proved to be an essential genomic resource in case of crop plants for their genetic improvement and better utilization by humans. To dissect the Gossypium hirsutum genome for large-scale development of genomic resources, we adopted hypomethylated restriction-based genomic enrichment strategy to sequence six diverse genotypes. Approximately 5.2-Gb data (more than 18.36 million reads) was generated which, after assembly, represents nearly 1.27-Gb genomic sequences. We predicted a total of 93 363 gene models (21 399 full length) and identified 35 923 gene models which were validated against already sequenced plant genomes. A total of 1093 transcription factor-encoding genes, 3135 promoter sequences and 78 miRNA (including 17 newly identified in Gossypium) were predicted. We identified significant no. of molecular markers including 47 093 novel simple sequence repeats and 66 364 novel single nucleotide polymorphisms. In addition, we developed NBRI-Comprehensive Cotton Genomics database, a web resource to provide access of cotton-related genomic resources developed at NBRI. This study contributes considerable amount of genomic resources and suggests a potential role of genic-enriched sequencing in genomic resource development for orphan crop plants.

Introduction

Cotton, being the most important fibre crop, has been the subject of greater interest for its role as a model system in cell development, polyploidization and genomic integration studies (Kim and Triplett, 2001). The genus Gossypium is composed of approximately 45 diploid and 5 tetraploid species. Of these 50 species, only four are cultivated: Gossypium hirsutum, G. barbadense, G. arboreum and G. herbaceum. Most of the commercially cultivated cotton varieties are forms of G. hirsutum (an allotetraploid), which contributes more than 95% of the cotton crop production worldwide (National Cotton Council, http://www.cotton.org/, 2011).

The Gossypium genome sequence may provide greater insights for the crop improvement through molecular breeding and genetic engineering programmes. Considerable efforts have been made towards the dissection of cotton genomes (diploid as well as tetraploid) through different approaches. Presently, a total of 464595 Gossypium EST sequences have been deposited in the NCBI EST database (as of 15 April 2013), including 297522 ESTs from G. hirsutum. Genomic sequences of G. raimondii have been made publically available by Monsanto (NCBI Bioproject-PRJNA53291) and JGI-DOE (NCBI Bioproject-PRJNA171262). A recent draft genome of G. raimondii has also been reported (Wang et al., 2012). In case of G. hirsutum, whole genome sequences (unassembled) have been submitted to public domain (NCBI-SRS375727), but the draft genome is still unavailable.

As the most of the repetitive elements in the genome is hypermethylated at 5′-CG-3′ and 5′-CNG-3′ residues (Bennetzen et al., 1994; Gruenbaum et al., 1981; Rabinowicz et al., 1999; Vongs et al., 1993) compared to the genic regions, enrichment through hypomethylation (of low copy DNA) has edge over EST sequencing in capturing these genomic elements. Methylation filtration and hypomethylated partial restriction (HMPR) techniques have been successfully utilized for the genic-enrichment in Sorghum (Bedell et al., 2005), Ricinus (Foster et al., 2010; Rivarola et al., 2011) and Maize (Emberton et al., 2005; Gore et al., 2009).

Although the genomic resources for cotton are rapidly increasing with the time, the absence of complete genome sequence for any of the cultivated Gossypium species imposes difficulty in exploring its genomic attributes. The present study aims to explore the genic and regulatory regions and to develop genomic resources through sequencing of genic-enriched restriction libraries of six diverse G. hirsutum germplasms. We adopted a modified HMPR approach for the genic-enrichment in cotton genome, and the genic-enriched sequencing data was mined for the various genomic resources like genes, transcription factors, promoters, noncoding RNAs and molecular markers. Outcome of the study possesses a broader representation of genic and regulatory resources in tetraploid G. hirsutum genome. The data will enrich the cotton genomic resources and may help in its improvement through marker-assisted breeding and transgenic approaches.

Results

Sequencing and assembly

We selected six diverse G. hirsutum genotypes (JKC703, JKC725, JKC737, JKC770, MCU5 and LRA5166; Table S1) on the basis of their genetic diversity as observed in amplified fragment length polymorphism (AFLP) analysis (Jena et al., 2011). These genotypes are commonly used in breeding programme for varietal and mapping population development, and hence, identification of markers such as single nucleotide polymorphisms (SNPs) and simple sequence repeats (SSRs) will be directly applicable.

Two individual libraries enriched for nonrepetitive DNA (using ClaI fragments and HpaII fragments, Figure S1) were made from genomic DNA of each of the genotypes. After the sequencing of all six genotypes, a total of 18 368 939 genomic reads representing 5 298 872 511 bases (~5.29 Gb) were generated (Table 1). The length of reads was ranging from 40 to 1196 bases. A total of 3 680 762 plastid, 717 694 mitochondrial reads and 244 023 short reads (<50 bases) were removed.

Table 1. Sequencing output and quality filtration statistics of the twelve libraries used for sequence generation
GenotypesEnzymes usedTotal readsTotal bases (Mb)Avg. Read Length (b)Longest read (b)Filtered readsEnriched Reads
Chl.aMitobToo short
  1. a

    Chloroplast.

  2. b

    Mitochondrial.

JKC703HpaII1509044429.128463438975689442232121075626
ClaI1647001474.528863025247341234209521364181
JKC725HpaII1360299372.927470427027762720209641054157
ClaI1865501533.428697620525041793238491624696
JKC737HpaII1448726407.728180638098169477209971037133
ClaI1688291542.932267212589331990111901535749
JKC770HpaII1304640376.128811952944356796922322971266
ClaI1765469481.1273118221765349506132271518511
MCU5HpaII1153230316.827511964236548661827300688989
ClaI1474521416.628311884851676002629095948433
LRA5166HpaII1449002433.929964525872645684173781164687
ClaI1703215513.330162737649771235135371297097
Total 183689395298.82881196368076271769424402314280525

Finally, a total of 14 280 525 reads, representing 4 168 050 824 bases with more than 93% Q40plus bases, were used for the further analysis. These quality-filtered ClaI and HpaII reads from all the genotypes were assembled (Table 2; Table S2) using 454 gsAssembler v2.5.3 to generate 4 095 128 sequences (1.27 Gb) including 533 271 contigs and 3 561 857 singletons. The average contig size after assembly was 900 bp, and N50 value observed was 894 bases. A total of 3 786 780 sequences (533 271 contigs and 3 253 509 singletons) having length of 100 bp or more were taken up for the further analysis.

Table 2. Statistics of resources developed from assembled data of Gossypium hirsutum
Resources developed 
  1. a

    Between any of two software's.

Total sequences after assembly4095128
Total bases covered (Mb)1272
All contigs (≥100 bp)533271
Average contig size (bp)900
N50 contig size (bp)894
GC content37.76%
Repetitive content identified12.16%
Common gene modelsa (≥100 bp)93363
Nonredundant Gene models (Validated)35923
Complete Gene models21399
Transcription factors coding genes1093
Promoter sequences (≥500 bp)3135
Total rRNA fragments587
Total tRNA genes1332
Total miRNA genes78
Novel miRNA genes17
Total SSRs identified148930
Number of SSRs with primer designed57775
Number of novel SSRs47093
Number of nonredundant SNPs66444
Number of novel SNPs66364

We compared our data with the G. raimondii genome, which is similar to D-subgenome of the allopolyploid AD-genome. There are two available versions of G. raimondii genome, the DOE-JGI version (http://www.jgi.doe.gov/sequencing) and the other draft genome (Wang et al., 2012). A total of 1 253 276 and 1 250 825 sequences (contigs and singletons) were mapped against JGI sequences and the draft genome, respectively (Figure S2). These mapped sequences represent approximately 448 Mb of the D-genome.

Genotypic and enzyme-specific variations

To determine genotype-specific biases in the targeted region of genome, we individually assembled all the reads obtained from ClaI and HpaII fragments for each genotype (Table S2). The major variation in genotypes at the assembly level was shorter and less number of contigs in case of MCU5 (due to the low sequencing output). Further, we made a total of 15 possible combinations of two genotypes each and mapped their reads to assess the variation in targeted (sequenced) region (Figure S3A). We observed that almost similar regions (~60% or more) were targeted in all the genotypes. Further, enzyme-specific (ClaI and HpaII) reads for each genotype were assembled individually (Tables S3 and S4). Almost similar genomic regions (nonmethylated) were targeted by both the enzymes (Figure S3B). The results indicate that our enrichment strategy was not biased towards any specific genotype or enzyme. The difference in the fraction of bases mapped may be due to variation at the sequence level.

Repetitive sequences

Assembled sequences were further screened for repetitive elements. Although we adopted strategy to enrich our genomic DNA sample with genic sequences, broad fragment size range (300 bp–5 kb) was selected to capture low copy DNA, which in turn, captured some repetitive regions too. We identified 187 118 repetitive elements (Table S5), representing 12.16% of total data (excluding low complexity sequences). The most abundant repeats observed were retrotransposons (11.79%) with a majority of long terminal repeat (LTR) elements including 8.93% Gypsy-type and 2.62% Copia-type elements. Interestingly, the proportion of DNA transposons was much reduced (0.35% only) in our genic-enriched data. Finally, a total of 181 188 734 bases (14.24%) including all types of repeats were masked.

Gene predictions and comparison with sequenced plant genomes

We used only contigs (533 271) for gene prediction and annotation. We utilized three independent de novo gene prediction tools (Table 3), that is, AUGUSTUS (Stanke et al., 2006), GENSCAN (Salamov and Solovyev, 2000) and GlimmerHMM (Majoros et al., 2004). The gene models predicted by all three prediction tools were compared to identify 93 363 common gene models that were validated by at least two prediction tools. Sequencing of hypomethylated restriction-based library generated large number of shorter sequences having partial gene models, which resulted in higher number of gene models. The gene prediction by AUGUSTUS identified a total of 90 294 gene models, of which 2923 were unique for this tool only. GENSCAN predicted highest number of gene models (125 422) with 16 494 unique to GENSCAN. Similarly, 97 533 gene models were predicted by GlimmerHMM including 3692 unique gene models.

Table 3. Gene prediction and annotation details. Genes were predicted using three de novo gene prediction tools viz. AUGUSTUS, GENSCAN and GlimmerHMM. Common gene models were annotated against NCBI nr, TAIR10 protein and cotton EST database and against TAIR GO, KEGG and CDD databases for functional annotation
Gene prediction and annotation details 
Total gene models predicted (≥100 bp)
AUGUSTUS90294
GENSCAN125422
GlimmerHMM97533
Common Genes (b/w any of two software's)93363
Annotations
NCBI nr56246
Tair1047464
Cotton ESTs41361
Tair GO annotation
Biological Process46093
Molecular Function32924
Cellular components25555
KEGG Analysis
KO id assigned3248
Pathways identified87
CDD (Total hits)23819
Pfam domains9152

We compared predicted common gene models (93 363) with genes from G. raimondii (40 976 genes; Wang et al., 2012), Arabidopsis thaliana (35 386 genes; The Arabidopsis Genome Initiative, 2000), Vitis vinifera (30 434 genes; French-Italian Public Consortium for Grapevine Genome Characterization, 2007) and Ricinus communis (31 221 genes; Chan et al., 2010). A total of 71 685 gene models had similarity with genes from these sequenced plant genomes, representing 35 923 nonredundant genes (Figure 1). Thus, 21 678 gene models were not validated against any of the genomes, although 2842 of these had matches in NCBI nr database. Of total 40 976 genes reported in G. raimondii, 35 127 had matches in predicted gene models. We identified 796 gene models which were not present in G. raimondii, but showed similarity within other three genomes. We identified 21 399 complete gene models (full-length proteins) of total 90 294 gene models predicted by AUGUSTUS (Table 2, Data S1). A total of 16 509 complete gene models had matches with G. raimondii genome.

Figure 1.

Venn diagram showing distribution of predicted common gene models among four sequenced plant genomes (Gossypium raimondii, Vitis vinifera,Ricinus communis and Arabidopsis thaliana). Comparison was done by using BLASTX with cut-off e-value of 1e-10, and only best hits were considered.

Annotation of predicted gene models

We annotated all the 93 363 gene models that were supported by at least two prediction tools against three different databases (Table 3), that is, NCBI nr protein database, TAIR10 protein database and G. hirsutum EST database. G. hirsutum EST database was prepared using assembled EST sequences (Xie et al., 2011) from cotton EST database (East Carolina). Gene annotation of 93 363 predicted gene models against NCBI nr protein database (Data S2) resulted in total 56 246 hits, representing 36 745 annotations (nonredundant). Further, 10 796 of the full-length gene models had matches within the NCBI nr database. Thus, we identified 4292 complete gene models which were neither reported in NCBI nr database, nor in G. raimondii draft genome. Gene annotation with TAIR10 protein database (Data S3) resulted in 47 464 hits, which represent 16 296 annotations (nonredundant). Similarly, 41 361 hits were obtained against cotton EST database (Data S4) representing 19 270 annotations (nonredundant).

Transcription factors and promoters

We identified a total of 1093 transcription factor-encoding genes (647 full length), which were grouped in 50 transcription factor families on the basis of their similarities with the Arabidopsis transcription factors. The most abundant transcription factor families (Figure 2) were C2H2 (130, 11.89%), bHLH (100, 9.14%), AP2-EREBP (94, 8.60%), C3H (94, 8.60%) and MYB family (76, 6.95%).

Figure 2.

Distribution of predicted Gossypium hirsutum transcription factors in different TF families. The number of gene models in each TF family is indicated next to their name. C2H2 (130, 11.89%), bHLH (100, 9.14%), AP2-EREBP (94, 8.60%), C3H (94, 8.60%) and MYB family (76, 6.95%) were among the most abundant TF families.

We identified 24 839 TSSs (Transcription start sites) by using AUGUSTUS gene predictions, which were further analysed to predict 3135 promoters (with minimum 500 bases, Data S5A). Of these 3135 promoters, 924 were having a length of 1000 bases or more. We further annotated the gene sequences downstream to these promoters against NCBI nr database. A total of 2196 promoters were linked with known genes (annotated), of which only 113 were from Gossypium species. Further, we categorized these potential cis-regulatory elements to different fibre developmental stages on the basis of expression pattern of linked genes established through microarray expression profiles (GSE36228). Thus, a total of 322 promoters were categorized including 184 initiation-specific, 28 elongation-specific and 110 secondary cell wall stage-specific promoters (Data S5B). In addition, we performed motif analysis to identify over-represented cis-regulatory elements (Figure S4). Various motifs were observed to be significantly present including CCAACC, CTCTCT, TCCCCT, GAAGAA, GAAAAG and GGTGGTGG motif.

Functional annotations

We performed gene ontology analysis (http://arabidopsis.org/tools/bulk/go/index.jsp/) for the functional annotation of predicted 16 296 gene models that were annotated against TAIR10 protein database (Table 3; Figure 3; and Data S6). A total of 1 04 572 GO terms were assigned, of which 46 093 terms were categorized under biological processes, 25 555 under cellular components and 32 924 terms were categorized under the molecular functions category.

Figure 3.

Functional annotation of the gene models assigned with TAIR locus IDs into different GO categories (biological process, molecular function and cellular components).

We performed prediction of functional association of gene models (annotated with TAIR) with different pathways using KAAS (KEGG Automatic Annotation Server). In KEGG (Kyoto Encyclopedia of Genes and Genomes; http://www.genome.jp/kegg/) analysis, a total of 3248 KEGG orthologous ids were assigned, which were grouped in 87 different pathway models (Table 3; Data S7). In addition, putative function of common gene models was assigned by annotation against NCBI CDD protein database (http://www.ncbi.nl.nih.gov/Structure/cdd/cdd.shtml) using RPS-BLAST (Table 3; Data S8). We obtained a total of 23 819 annotations, of which maximum 9152 hits were with Pfam database. Significant numbers of annotation were also obtained with PLN (5464), Cdd (4169), Smart (1433), TIGR (1285), COG (1052) and PRK (864) databases.

Noncoding RNAs

We performed identification of noncoding RNAs (tRNA, rRNA and miRNA) in total data (Table 2). A total of 1332 tRNA genes were identified on the basis of their similarity with known eukaryotic tRNA models (Table S6A). These tRNA genes also include 144 pseudogenes, 15 tRNA genes with undetermined isotypes and 1 suppressor tRNA gene. Similarly, we identified 514 rRNA genes (average fragment size 249.7 bp) on the basis of similarity with Arabidopsis and Oryza rRNA sequences (Table S6B). Because miRBase (release 18; http://mirbase.org) contains only 45 Gossypium miRNA sequences, we included all the miRNA sequences reported in plants for miRNA prediction in our data. Recent draft genome of G. raimondii (Wang et al., 2012) contains a total of 31 miRNA (29 families, total 348 sequences within genome). Following a set of strict filtering criteria, we identified 78 miRNAs (42 families), of which 17 miRNAs (17 families) were new to Gossypium species (Table S6C; and Data S9A). We further performed miRNA target prediction and identified significant number of targets for most of the miRNAs in our data (Data S9B).

Simple sequence repeats (SSRs)

All the 4 095 128 sequences (contigs and singletons) were used to search microsatellite using MISA software with a criterion of minimum 5 repeat motifs for each SSR type, except for mononucleotide repeats (MNRs) where minimum of 10 repeats were considered. Thus, we identified a total of 148 930 SSRs (in 135 365 sequences) including 12 408 MNRs (Table S7; Data S10). MNRs were excluded from the further analysis and rest 130 223 sequences having 136 522 SSRs considered. Among the various repeat types, the dinucleotide repeats (DNRs) were most abundant (80.14%) followed by trinucleotide repeats (TNRs; 15.16%) at the criteria we have adopted. Of 130 223 SSR containing sequences (136 522 SSRs), 56 142 (38%) sequences were found suitable for the primer designing. These sequences include 54 626 (~97%) sequences with simple sequence repeats and 1516 (~3%) sequences with compound repeats (total 57 775 SSRs). Of 57 775 SSRs, 47 093 SSRs (45 972 simple and 1121 compound SSRs) were novel with already available SSR sequences in Cotton Microsatellite Database (CMD, http://www.cottonmarker.org/) (Figure S5). We observed an average frequency of approximately 1 SSR per 8.58 kb (148 930 SSRs in 1.272 Gb) of the sequences analysed. Further, we mapped the 148 930 SSRs with their flanking sequences (up to 100 bp on both sides) on G. raimondii genome, which resulted in mapping of 45 766 SSRs on the draft genome (Figure 4a). Similarly, 44 652 SSRs were mapped on G. raimondii genome sequence reported by JGI (Figure 4b).

Figure 4.

Distribution of Gossypium hirsutum simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs) containing sequences on G. raimondii reference genome. SSRs and SNPs containing sequences from our study were mapped on both versions of G. raimondii genome using gsReferenceMapper v2.5.3. (a) The draft genome (b) JGI version. Mapped SSRs and SNPs were graphically represented on 13 chromosomes using circos. Outermost coloured circle is the graphical representation of G. raimondii chromosomes. Chromosome numbers are indicated after the chromosome (Chr) abbreviation. Green (middle) circle represents mapped SNP containing sequences, whereas red (inner) circle represents mapped SSR containing sequences.

Single nucleotide polymorphisms (SNPs)

We identified a total of 422 617 SNPs in all 15 combinations of genotypes (Tables S8 and S9). After filtering under-represented SNPs (in <3 reads from a genotype) and nonallelic SNPs (polymorphism within reads from any single genotype; Figure S6), we obtained 75 714 SNPs in 34 032 consensus sequences (111–3261 bp). We removed redundant SNPs across all the 15 combinations and identified a total of 66 444 unique SNPs, which were distributed among 24 612 genic (4446 in UTRs, 15 648 exonic and 4518 intronic) and 41 832 nongenic SNPs. Further, we identified 2604 synonymous and 6506 nonsynonymous SNPs (Table S8). We observed a total of 43559 transitions (~66%) and 22885 (~34%) transversions (Data S11). We mapped all the 66 444 SNPs with their flanking sequences (50 bp on both sides) on the G. raimondii draft genome and genome sequences by JGI, which resulted in mapping of 31 513 and 31 246 SNPs on these versions of genome, respectively (Figure 4a and b).

In addition, we checked for the novelty of 66 444 unique SNPs against all the 29234 Gossypium SNPs submitted to NCBI dbSNP (http://www.ncbi.nlm.nih.gov/snp?term=gossypium%20hirsutum) and identified 66 364 novel SNPs (80 redundant SNPs). We observed frequency of 1 SNP per 388 bases of sequence analysed. A total of 6927 SNP containing coding sequences were annotated against Pfam database (Table S10). Most of the coding SNPs were enriched in protein kinase (292 SNPs), hydrolase (254 SNPs), cytochrome P450 (234 SNPs), AAA domain (216 SNPs), AP endonuclease family (166 SNPs) and NB-ARC domain (108 SNPs).

Validation of developed resources

We performed some representative experiments to validate different genomic resources identified in this study. We checked the expression of 15 randomly selected genes from 4292 unannotated full-length gene models (Figure S7) and 15 transcription factor-encoding genes (Figure S8) in leaf, root and fibre tissues of G. hirsutum (genotype JKC725) by real-time PCR analysis (primer details: Table S11). All of these genes were found to be expressed in one or other tissue (mostly differentially expressed), thus validating the predicted gene models and transcription factor-encoding genes. We also performed genotyping of 40 randomly selected SSRs on 12 genotypes (Data S12) and MassARRAY iPLEX analysis (on Sequenom) for genotyping of 30 SNPs on 6 genotypes (Data S13). All the SSRs were amplified throughout the genotypes, except in few cases, where nonamplification was observed. A total of 6 SSRs (15%) were found to be polymorphic. Similarly, all the SNPs were present validating our identification of SNPs in G. hirsutum genome. In some cases, we observed nonallelic or homeo-SNPs but that were not in target genotypes (genotypes in which SNP was identified). To check the presence of predicted miRNAs in leaf, root and fibre tissues, we performed Northern blot assay (Fig. S9). Although, U6 (taken as control) was hybridized on the blot showing its presence, the two other novel miRNAs (ghr-miR-3696 and ghr-miR-5065) were not hybridized indicating the absence of their expression in leaf, root and fibre tissues under normal condition. The expression of novel miRNAs may be strictly spatial-, temporal- or condition-specific, and hence, their expression could not be identified in the tissues tested.

NBRI-comprehensive cotton genomic database

To provide open public access to various cotton resources developed at NBRI, we developed a web resource NBRI-Comprehensive Cotton Genomics database (NCGD) which is publically available at http://www.ncgd.nbri.res.in/.

Discussion

We generated G. hirsutum genomic sequences enriched for nonrepetitive DNA. The advantage of using differential methylation-based strategy is that, unlike EST sequencing, it also captures intronic and promoter sequences. In case of hypomethylated restriction-based strategies, the possibility of capturing nontargeted DNA elements (including repetitive and other nongenic DNA) in libraries increases with the size of fragments selected after restriction.

We selected six diverse G. hirsutum genotypes, four of which have superior fibre traits (JKC725, JKC770, LRA5166 and MCU5), while the two others (JKC703 and JKC737) have inferior fibre quality. These genotypes are commonly used by us in breeding programme for varietal and mapping population development. We observed that SSRs identified in this study showed good polymorphism (15%, Data S12), and the representative SNPs taken for validation were present in the target genotypes (Data S13), thus these identified markers may be immediately used in marker-assisted breeding (MAB). Further, the genotyping by sequencing (GBS) is an emerging trend in MAB (Elshire et al., 2011). Technique depicted here for complexity reduction and identification of SNP could be a potential tool for GBS as all the SNPs selected were validated by Sequenom MassARRAY (Data S13).

Transposable elements contribute to a large fraction of the genomes (56.35% repetitive elements in G. raimondii genome, Wang et al., 2012). As the selection of larger fragments from methylation sensitive restriction libraries resulted in capturing more repetitive elements, a significant fraction of repetitive sequences (12.16%; Table S5) were present in our data. The retrotransposons were the most represented (96.99%), including LTR retrotransposons (94.2%) which are reported to be the major contributors to repetitive fraction of G. raimondii genome (~81% of total repetitive elements). The fraction of DNA transposons was 2.86% in comparison with 7.97% in G. raimondii. There was an accumulation of Gypsy-like retrotransposons (73.44% of all repeat types), which is significantly higher than Gypsy elements observed in G. raimondii genome (59.41%), while the fraction of Copia-like elements (21.54%) was comparable to that of G. raimondii genome (19.5%).

Although the Roche 454 GS Titanium generates longer reads, major constrain with our approach (methylation sensitive complete restriction) was the generation of shorter assembled fragments (average contig size 900 bp; Table S2). This has been reflected in the gene prediction analysis (Table 3), where multiple partial gene models were predicted for a single gene, resulting significant inflation in total number of predicted gene models. Nonetheless, we identified 21 399 complete gene models (full-length proteins, Data S1), of which only 10 796 were annotated with NCBI nr database and 16 509 were present in G. raimondii (Wang et al., 2012). Here, we report 4292 complete gene models for Gossypium that were neither reported in G. raimondii nor in NCBI database. Comparison with G. raimondii genome resulted in mapping of almost 35% sequences (contigs and singletons) representing ~448 Mb of the draft genome. All the selected unannotated gene models were found to be expressed in one or the other tissues (Figure S7), validating our prediction of gene models.

Comparison with genes from sequenced plants (Figure 1) indicates that almost 86% genes reported in the G. raimondii draft genome (Wang et al., 2012) were represented by the gene models predicted in our data. Collectively, 76.78% of the predicted gene models (93 363) were validated against the sequenced plant genomes (G. raimondii, A. thaliana, V. vinifera and R. communis). Further, we identified a total of 796 genes, validated against other plant genomes, which are not reported in G. raimondii. Of total 533 271 contigs, gene models were distributed only over 150 152 contigs (28.16%).

Of total 3135 predicted promoters (Data S5), although 2196 were annotated with known genes, only 113 annotations were from Gossypium genes. Fibre-specific promoters and cis-regulatory motifs are not well studied in Gossypium. A large number of genes encoding various transcription factors (total 2706) were reported in G. raimondii genome (Wang et al., 2012), of which 1077 were present in our data. Thus, we identified a total of 16 transcription factors, including some members of C2H2, bHLH and C3H families, which were not present in G. raimondii genome. We also performed the expression analysis of 15 transcription factor-encoding genes (new to Gossypium species; total 16) and found their expression in different cotton tissues. Some of these genes were specifically expressed in root or fibre tissues indicating their significance for further studies.

In cotton, various studies have been performed to identify miRNAs and their targets (Yang et al., 2013; Yin et al., 2012), but limited information is available in public domain for Gossypium miRNAs (only 45 Gossypium miRNAs in miRBase). Here, we identified a total of 78 miRNAs (Data S9A), including 17 which were new to Gossypium. Squamosa promoter-binding protein-like (SPL) gene family is well established as a target for miRNA-156 (Rhoades et al., 2002; Wang et al., 2009). Here, we identified that miRNA-1028 may also be involved in regulation of SPL genes.

In the present study, we report 47 093 novel SSRs (Table S7; Figure S5) developed from six diverse G. hirsutum genotypes, which is expected to be a significant addition to the presently available repertoire of microsatellite markers. Approximately, 30% SSRs were mapped on G. raimondii genome (Figure 4). Recently, 11 834 SNPs have been identified in nongenic region by sequencing of genomic reduction by restriction site conservation (GR-RSC) libraries from two accessions of G. hirsutum (Byers et al., 2012). Here, we report 66 364 novel SNPs in addition to available marker resources in G. hirsutum (Data S11). We observed higher frequency of transitions (~66%) than transversions (~34%) which is similar to the previous reports in other plants (Deutsch et al., 2001; Garg et al., 1999). Almost 47% of SNPs were mapped on G. raimondii genome (Figure 4). We also genotyped 40 randomly selected SSRs on 12 G. hirsutum genotypes and 30 randomly selected SNPs on 6 genotypes (selected for sequencing), and observed that all the SSRs and SNPs were present in the cotton genome.

This study contributes considerable amount of genomic information to the G. hirsutum domain, but further utilization of these resources will require considerable efforts at every level.

Experimental procedures

Genotype selection

The diversity of genotypes was determined on the basis of their genetic distance observed in AFLP analysis performed on twelve cultivated G. hirsutum genotypes by Jena et al., 2011 and six most diverse genotypes were selected.

Genomic DNA isolation and hypomethylated restriction library preparation

Genomic DNA was isolated from the leaf tissue using modified C-TAB method (Jena et al., 2004). Hypomethylated restriction libraries were prepared from genomic DNA of each of the six genotypes using ClaI and HpaII restriction endonucleases individually. Twenty μg of genomic DNA was completely digested using HpaII and ClaI (200 U, 37 °C and 16 h) in individual reactions, and the enzyme was heat inactivated at 65 °C for 20 min. Digested DNA was precipitated using sodium acetate-ethyl alcohol and dissolved in 100 μL of MQ water. DNA fragments ranging from 300 bp to 5 kb DNA were excised and gel-eluted.

454 sequencing and data processing

Sequencing library preparation and 454 sequencing was done using Roche GSFLX Titanium sequencing kit. A total of 3 μg of genomic DNA (300 bp–5 kb) was nebulized at 30 psi for 90 s and purified using MinElute PCR Purification Kit (QIAGEN, Hilden, Germany). The fragments smaller than 300 bp were removed by Agencourt Ampure XP beads (Beckman Coulter, Brea, CA) treatment. Further, 100 ng Ampure-treated DNA was taken for genomic library preparation. For each genotype, two individual libraries were prepared (one for ClaI fragments and other for HpaII fragments). These libraries were sequenced on Roche 454 Genome Sequencer (Titanium, GS sequencer v2.5, 454 Life Sciences, Branford, CT) using a full pico-titer plate. Signal processing was performed using gsRunBrowser v2.5.3.

Quality filtration and assembly

Cotton chloroplast sequence was downloaded (NC_016711, http://www.ncbi.nlm.nih.gov/nuccore/372291753), and all the reads were screened at criteria of 40-bp overlap with minimum 95% identity using gsReferenceMapper v2.5.3 (454 Life Sciences, Branford, CT). Similarly, mitochondrial reads were removed using Vitis vinifera mitochondrial DNA (NC_012119, http://www.ncbi.nlm.nih.gov/nuccore/224365609) at the criteria of 40-bp overlap with minimum 85% identity. Short reads having <50 bases length were also removed. All the quality-filtered reads generated from 14 sequencing run were assembled together using gsAssembler v2.5.3 using parameter 40-bp overlap with minimum 95% identity and ‘large and complex genome’ option. Additionally, genotype-wise and enzyme-wise assemblies were also performed, using all the filtered reads from individual genotypes and from individual enzymes, respectively.

Identification of repetitive elements

Similarity-based identification of repetitive elements was performed in the assembled data (contigs and singletons). We used RepeatMasker-open-3-3-0 (http://www.repeatmasker.org/RMDownload.html) programme to identify and mask the repetitive elements. A custom library was made by taking together the sequences from RepBase (repeatmaskerlibraries-20110920; http://www.girinst.org/server/RepBase/index.php) and repeats already reported in plants.

Gene prediction and annotations

For Gene prediction, we adopted de novo approach using AUGUSTUS, GENSCAN and GlimmerHMM with parameters trained on A. thaliana. Common gene models were identified in masked sequences (contigs only) by similarity search using reciprocal BLAST between predictions from all three methods. Gene models (with longest sequences) predicted by more than one prediction methods were taken to make a final gene pool. Full-length gene models were identified by the presence of start codon as well as stop codon in gene models.

The models from final gene pool were further annotated against NCBI nonredundant protein database (Release: 20th December 2011) and TAIR 10.0 protein database (ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/TAIR10_pep_20101214) individually using BLASTX with cut-off e-value of 1e-10 and minimum alignment length 50% of the query sequence (with minimum length of 100 bp). The predicted gene models were also annotated against G. hirsutum EST database (http://www.leonxie.com/datadownload.php; Xie et al., 2011) comprising of assembled ESTs (28 432 unique sequences).

Mapping on G. raimondii and other sequenced plant genomes

All the sequences were mapped on both the versions of G. raimondii genome (JGI version: ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Graimondii/ and the draft genome ftp://public.genomics.org.cn/BGI/cotton/Assembly/G.raimondii.chromosome.fasta.gz) by using gsReferenceMapper v2.5.3. First, we mapped raw reads at 5 different criterion (% identity), that is, 95, 96, 97, 98, 99 and 100 per cent (Figure S2). Further, assembled sequences were mapped and only fully mapped sequences (100% alignment length with 95% identity) were used for the further analysis. The predicted common gene models (93 363) were compared with the protein sequences of G. raimondii, A. thaliana, V. vinifera and R. communis, downloaded from the NCBI Genebank, by using BLASTX with cut-off e-value of 1e-10. Only best-hit matches were considered for the analysis.

Identification of transcription Factor-encoding genes

For the identification of transcription factor-encoding genes, Arabidopsis locus IDs assigned to the gene models were queried against the AGRIS transcription factor database (http://arabidopsis.med.ohio-state.edu/AtTFDB/).

Identification of promoters

Promoter containing sequences (with a minimum of 500 bases upstream to TSS) were identified using AUGUSTUS gene predictions. These sequences were matched with microarray expression profiles (GSE36228) from three cotton fibre developmental stages (initiation, elongation and secondary cell wall synthesis) and the corresponding sequences were identified. Motif analysis in predicted promoters (with more than 500 bases upstream to TSS) was performed by MEME version 4.3.0 (Bailey et al., 2009) using ZOOPS model. Position-dependent letter matrix (PWM) was used to determine a score for any 6–10 bp sequences (P-value 1e-5) and represented in logo format.

Functional annotations

For the functional categorization, gene ontology (http://arabidopsis.org/tools/bulk/go/index.jsp/), analysis was performed using TAIR GO annotations. The GO terms associated with Arabidopsis loci (best BLASTX hit) were assigned to the corresponding gene models. For KEGG analysis (http://www.genome.jp/kegg/tool/map_pathway2.html), gene models annotated with TAIR were searched for KEGG Orthology (KO) IDs using KEGG-automated annotation server, KAAS (http://www.genome.jp/kaas-bin/kaas_main?mode=partial). The assigned KO IDs were further used for matching the corresponding genes with KEGG to predict the involved pathways. Conserved domain footprints and functional sites in gene models annotated with TAIR were analysed using RPS-BLAST search against NCBI CDD database (http://www.ncbi.nl.nih.gov/Structure/cdd/cdd.shtml; Marchler-Bauer et al., 2011) with e-value 1e-5.

Identification of noncoding RNAs

We identified probable tRNA-encoding genes with tRNAscan-SE 1.21 (Lowe and Eddy, 1997; http://lowelab.ucsc.edu/tRNAscan-SE/) tools by selecting eukaryotic tRNA model. The rRNA genes were predicted using similarity-based approach as well as by RNAmmer 1.2 Server (www.cbs.dtu.dk/services/RNAmmer/). The similarity-based search was performed by BLASTN against Arabidopsis and Oryza rRNA sequences with e-value 1e-5.

We performed BLASTN against a set of previously identified plant miRNAs and their precursor sequences (miRBase release 18; http://mirbase.org) with e-value 1000 and maximum mismatch of three. Precursor sequences and their thermodynamic energy were obtained by using offline tool Triplet SVM (http://bioinfo.au.tsinghua.edu.cn/software/mirnasvm/). In addition, the target sequences/sites of all the identified miRNAs were predicted in our data using Perl script with strict parameters (Allen et al., 2005; Schwab et al., 2005), and these sequences were further compared against NCBI nr database using above mentioned criteria.

Identification of simple sequence repeats (SSRs)

Simple sequence repeats (SSR) motifs were identified by using MISA software (Thiel et al., 2003) with a criterion of minimum five repeat motifs for each SSR (minimum ten motifs for MNRs). Primer pairs were designed (excluding MNRs) from the flanking sequences using PRIMER3 software (Rozen and Skaletsky, 2000) in batch mode. The SSRs with up to 100-bp flanking sequences were mapped on the G. raimondii genome (both JGI version and the draft genome) using a criterion of minimum 80% coverage with at least 95% identity, and circle diagrams were made with circos (Krzywinski et al., 2009). The novelty of these SSRs (Jena et al., 2012) was checked by comparing complete SSR containing sequence and 50-bp flanking region (upstream and downstream) with already available SSR sequences in Cotton Microsatellite Database (CMD, http://www.cottonmarker.org/, Blenda et al., 2006). Additionally, primer pairs from SSR containing sequences were also matched with reported sequences.

Identification of single nucleotide polymorphisms (SNPs)

Single nucleotide polymorphisms were identified by executing AutoSNP 2.0 program (Barker et al., 2003) on sequences from six diverse genotypes. A total of 15 combinations having reads from two genotypes at a time were made and were assembled together using parameter 40-bp overlap with minimum 97% identity to detect SNPs within these combinations. The assembled sequences with <6 reads (minimum 3 reads from each genotype) were filtered out. Reads within rest of the sequences were used as input for AutoSNP (Barker et al., 2003). The reads with SNP position were back-traced, and polymorphism within reads of same genotype was considered as nonallelic SNP (Figure S6). A custom script was designed to remove these nonallelic SNPs and contigs having <3 reads (depth within genotype) at polymorphic position for the analysis. Further, the SNPs which were redundant within different genotype combinations were also removed. We performed de novo gene prediction within all high-quality SNP consensus sequences using AUGUSTUS to predict SNPs in genic (synonymous and nonsynonymous) and nongenic region of genome. Type of polymorphism (transition/transversion) was also determined using previously reported parameters (Wakeley, 1996). The novelty of SNPs was determined by BLAST search against SNP containing Gossypium sequences downloaded from dbSNP (http://www.ncbi.nlm.nih.gov/snp?term=gossypium%20hirsutum). The sequences with coding SNPs were further annotated against Pfam (http://www.ncbi.nl.nih.gov/Structure/cdd/cdd.shtml) to determine their association with protein families and functional domains. All the SNPs with minimum 50-bp flanking sequences were mapped on the G. raimondii genome (both JGI version and the draft genome) using a criterion of minimum 95% coverage with at least 95% identity. The circle diagrams were made with circos (Krzywinski et al., 2009).

Validation of developed resources

Validation of predicted gene models and transcription factor-encoding genes was done by real-time PCR-based (ABI-7500 Fast) expression analysis using SYBR chemistry. The total RNA was isolated from leaf, root and fibre tissues of JKC725 genotype, and converted to cDNA using SuperScript II cDNA synthesis kit (Invitrogen, Carlsbad, CA). PCRs were set up in 10 μL-volume using Express SYBR GreenER 2 × master mix. The primers used are listed in Table S11. For SSR genotyping, PCR amplification with genomic DNA from 12 G. hirsutum genotypes with designed primers (Data S10) was followed by fragment analysis on capillary-based 3730xl DNA Analyzer (ABI), and the results were analysed using GeneMapper v4.0 (ABI, Foster City, CA). SNP genotyping (Data S13) with 6 genotypes (used in sequencing) was performed on Sequenom MassARRAY iPLEX platform using standard manufacturer's protocol (Gabriel et al., 2009). To check the miRNAs (ghr-miR-3696 and ghr-miR-5065; U6 as control) expression in leaf, root and fibre tissues, total RNA (10 μg, small RNA enriched) was isolated using mirVana miRNA Isolation Kit (Ambion, Austin, TX). Northern blot assay was performed with radiolabelled (γ-P32-ATP) probes using standard protocols (Pang et al., 2009).

Data Access

The 454 sequencing reads have been submitted to NCBI SRA (SRA053686). Simple sequence repeats (SSRs) have been submitted with NCBI GenBank (Accession nos. JX576804-JX623896) and single nucleotide polymorphisms (SNPs) have been submitted with NCBI dbSNP (NCBI_ss538789672 - ss538865759).

Acknowledgements

This work has been funded by Council of Scientific and Industrial Research (CSIR), India under NMITLI (TLP400925) and Institutional Projects (SIP05 and BSC0107). The authors would like to thank JK Agri-Genetics Limited, Hyderabad for providing Cotton genotypes, Mr. Rajiv Tripathi, NBRI, for his assistance in Northern blot analysis, Mohammad Zubair Nizami and Mr. Vikram Srivastav, ICT Division, NBRI for technical assistance in development of web resource. SKS and VR thank to CSIR, India for providing fellowship.

Competing interest

The authors declare that they have no competing interests.

Ancillary