Sequence analysis and assembly
Approximately 162 million (161 923 766) high-quality trimmed reads from 174 million (173 910 356) raw reads were generated from the deep sequencing of sea cucumber (Apostichopus japonicus) transcriptome using Illumina HiSeq™ 2000 platform. The raw reads produced by sequencing five cDNA libraries have been deposited in the NCBI SRA database (accession number: SRA050267 including SRX122152, SRX122619 and SRX122622). The high-quality reads containing 1 235 031 preprocessed 454 sequencing reads and 6731 ESTs from the public database were also included in the A. japonicus transcriptome analysis (Table 1).
Table 1. Summary statistics of transcriptome assembly for Apostichopus japonicus
|Type of data||Total clean reads||Total clean nucleotides||Total consensus sequences||Total length||Mean length||N50|
|454 reads||1 235 031||624 161 079||34 937 (454Isotigs)||43 000 002||1230||1550|
|Sanger reads||6731||4 044 547||6731 (ESTs)||4 044 547||600||689|
|Illumina reads||161 923 766||14 573 138 940||94 528 (Contigs)||57 761 480||611||837|
|Assembly Results|| || ||94 704 (Unigenes)||76 736 126||810||1335|
Assembly of these reads generated 94 704 tentative consensus sequences (nonredundant sequences or unigenes), ranging from 200 to 16 153 bp, with an average size of 810 bp (Table 1). The total assembled unigenes included 38 587 transcripts which were attributed to the different sequence splicing of 22 000 genes. Each of these unigenes contained at least two sequences with pairwise sequence similarity larger than 70%. The number of different splicing isoforms in A. japonicus is not as high as in vertebrate (Wang et al. 2008). Combined with the 454 sequencing data (Du et al. 2012), 84.9% of the high-quality reads were incorporated into the unigenes, and the quality of assembled transcriptome was improved greatly than only de novo assembly using Illumina or 454 sequencing reads. The summary statistics of transcriptome assembly indicated that 454 pyrosequencing and Sanger reads facilitated effective assembly of the Illumina short reads and the assembled transcripts by Illumina reads covered nearly all the 454 and Sanger sequences (Table 1). The size distribution of unigenes is shown in Fig. 1. More than 50 000 of the total unigenes were not matched with the transcriptome assembly result from the 454 sequences (Du et al. 2012). With deeper sequencing and coverage in this study, this result suggests more representative collections of A. japonicus genes obtained. Another reason is that huge amount of short reads generated makes the transcriptome assembly difficult, which is not only impeded by repeats but also by alternatively spliced transcripts. The allelic variation is also a significant impact factor for the assembly of transcriptome.
Figure 1. The length distribution of unigenes. Unigenes were generated from de novo assembly of Illumina sequencing reads and public database, and were compared with the 454 sequences assembly.
Download figure to PowerPoint
Unigene sequences were first annotated by blastx to protein databases nr, Swiss-Prot, KEGG and COG (e-value < 1e-5), and then annotated by blastn to nucleotide databases nt (e-value < 1e-5). Of all the 94 704 assembled sequences, 29 357 showed significant matches to nr, 24 573 to Swiss-Prot, 21 139 to KEGG, 10 124 to COG and 11 003 to nt databases. Altogether, 32 479 had significant matches, at least one hit to these databases.
Among the 94 704 unigenes, 36 005 (38.2%) were predicted as CDSs (>100 bp), ranging from 102 bp to 14 625 bp with an average length of 721 bp (Fig. 2). The transcripts without predicted CDSs consist of assembled transcripts with low coverage values and short length. For instance, 49 817 (52.6% of all unigenes) of these transcripts are shorter than 500 bp, and only 4.0% of all reads were mapped to these transcripts. These short and low coverage transcripts may represent chimeras resulting from assembly errors, fragmented transcripts of low expressed genes, as well as noncoding RNA.
Approximately 34.3% of unique consensus sequences were successfully annotated in this study. This ratio is comparable with those reported in other de novo transcriptome sequencing studies for nonmodel organisms (Clark et al. 2010; Franchini et al. 2011; Hou et al. 2011; Du et al. 2012; Qin et al. 2012). Among the annotated unigenes, about 13 706 (42.2%) were matched to Strongylocentrotus purpuratus, 4255 (13.1%) to Saccoglossus kowalevskii, 2696 (8.3%) to Branchiostoma floridae, 877 (2.7%) to Nematostella vectensis, 812 (2.5%) to Danio rerio, 10 133 (31.2%) to Oreochromis niloticus and a small amount to other species.
Gene Ontology (GO) annotation was further performed for the annotated unigenes in terms of biological process, molecular function and cellular component. In total, 14 597 unigenes were assigned with at least one GO term for a total of 116 226 GO assignments. Distribution of the unigenes in three different GO categories is shown in Fig. S1 (Supporting Information). For biological process, cellular process was the most abundant GO terms (17.2%), followed by metabolic process (12.9%) and biological regulation (8.7%). For molecular function, genes encoding binding (47.6%) and catalytic activity (35.7%) proteins were highly represented in GO terms. For cellular component, major categories were cell (22.2%), cell part (22.1%) and organelle (16.2%).
Gene Ontology term assignment for molecular function revealed that a total of 2009 unigenes were annotated with the category of protein binding (GO: 0005515), which contained the most abundant genes, including some important immune-related and stress-related genes, such as thioredoxin, interleukin enhancer binding factor 3, scavenger receptor cysteine-rich protein type 12 precursor, Hsc70-interacting protein, heat shock protein 26 and melanotransferrin. These results might be due to several cDNA libraries used for the Illumina sequencing derived from the samples after LPS challenge.
KEGG pathway analysis showed that 21 139 unigenes were mapped to 255 pathways, of which metabolic pathways were the most abundant (7402). Several pathways, such as complement and coagulation cascades, chemokine signalling, cytokine–cytokine receptor interaction and Toll-like receptor signalling pathways, are clearly linked with immune responses. These results will provide a basis for future studies, such as gene-associated markers identification, gene cloning and expression analysis.
Being the basal deuterostomes, echinoderms occupy the top taxonomic position of invertebrates, where is the evolutionary bridge between invertebrates and vertebrates. In addition to sea urchin S. purpuratus whose genome has been sequenced and published, the availability of transcriptomes and genomes from other classes in echinoderms is expected to promote phylogenetic and comparative evolutionary genomics, and to enable the characterization of the function gene repertoire for different echinoderm species.
In this study, using TreeFam and the pipeline described in method, we obtained 23 145 gene families from A. japonicus and the six reference genomes. The common and unique gene families among A. japonicus, Ciona intestinalis, B. floridae and S. purpuratus were summarized in Fig. 3. To reconstruct the evolutionary relationship between A. japonicus and other animals, a set of 127 single-copy gene families was obtained by Treefam method. And it was concatenated to a super peptide (84 261 peptide sites) for constructing phylogenetic tree using PhyML (Fig. 4). The results showed that except Crassostrea gigas, other six species formed three clades by every two species, and A. japonicus was closely related with S. purpuratus. Both S. purpuratus and A. japonicus belong to echinoderms, and our phylogenetic analysis reaffirmed their taxonomic closeness. Additionally, the phylogenetic analysis also revealed the evolutionary relationships of the represent marine species and the vertebrate. Molecular evidence is considered to be more effective for taxonomic classification compared with other existing methods. The transcriptome data and mitochondrial genome for phylogeny and orthology analysis have become increasingly popular (Castoe et al. 2009; Meusemann et al. 2010; Kocot et al. 2011). Especially in certain taxon, such as arthropods, echinoderms, tetrapods and snakes which possess radical transformations both morphologically and physiologically (Carroll 1997), molecular data can make robust contributions in phylogenetic analysis.
Figure 3. Venn diagram of unique and common genes among the Apostichopus japonicus, Branchiostoma floridae, Strongylocentrotus purpuratus and Ciona intestinalis.
Download figure to PowerPoint
Identification of gene-associated markers
Owing to their potentials for high genotyping efficiency, automation, data quality, genome-wide coverage and analytical simplicity (Morin et al. 2004), SNPs have rapidly become the marker of choice for many applications in genetics and genomics studies (Liu et al. 2011). In this study, we identified 142 511 high-quality SNPs from 94 704 unigenes (Table 2). The putative SNPs included 82 664 transitions and 59 847 transversions. The minor allele frequencies of SNPs including transitions and transversions were estimated from the sequence data (Fig. 5). The overall frequency of all types of SNPs in the transcriptome was one per 538 bp. The distribution of filtered SNPs in per contigs is shown in Fig. 6. Of all the putative SNPs, 137 616 (91.0%) were identified from contigs composed by more than ten reads, suggesting that the majority of SNPs identified in this study were covered at sufficient sequencing depth and more likely represent ‘true’ SNPs. Among these SNPs, 89 375 (59.1%) were identified from contigs with annotation information, and they were distributed in 15 473 known genes (Table 2).
Figure 5. Distribution of minor allele frequencies of single nucleotide polymorphisms (SNPs) identified for Apostichopus japonicus. The X-axis represents the SNP sequence derived minor allele frequency in percentage, while the Y-axis represents the number of SNPs with given minor allele frequency.
Download figure to PowerPoint
Figure 6. Distribution of filtered single nucleotide polymorphisms (SNPs) per contig. Histograms depict frequency of contigs with a given number of SNPs identified.
Download figure to PowerPoint
Table 2. Summary of single nucleotide polymorphisms (SNPs) identified from the Apostichopus japonicus transcriptome
|Total SNPs||142 511|
|Number of contigs with SNPs||26 186|
|Number of known genes containing SNPs||15 473|
Deep analysis of the functional SNPs showed that 33 775, 63 120 and 45 616 SNPs were distributed in sequences without predicted CDS (non-CDSs), coding sequences (CDSs) and untranslated regions (UTRs), respectively. SNPs occur in protein-coding regions are beneficial for assessing the polymorphisms that directly affect the phenotype related to important economic traits. On the other hand, high proportion of coding SNPs developed from disease or stress-associated functional genes implying the result of natural selection. Thus, the transcript region containing sequences variations can be used for explaining the influences of natural selection at the gene and protein levels (Ellegren 2008). In some cases, mutations in coding regions may cause the loss of protein functions leading to species extinction. Compared with this destructive mutation, beneficial mutations can be explained by the retained traits during evolution (Lynch et al. 2006; Zhu et al. 2012).
Recently, comparative genomic analyses have been conducted on noncoding region where the level of conserved sequences is similar to protein-coding genes (Bejerano et al. 2004; Dermitzakis et al. 2004). Nevertheless, the mutations occurred in which regions ultimately determine the molecular function and organism's fitness need profound discussion (Kryukov et al. 2005).
For the DEGs, GO term assignment and KEGG pathway analyses showed that important genes and signalling pathways associated with growth, metabolism, disease, immunity and stress responses have been identified in the transcriptome data. Insights into SNPs in these genes including Cu/Zn superoxide dismutase (14 SNPs), heat shock protein 70 (6 SNPs) and cytochrome P450 (15 SNPs) help to understand individual's resistance to hypoxia or oxidative stress from marine environment. It is noteworthy that mannan receptor possesses 89 SNPs showing highest polymorphism among all the unigenes. Previous studies demonstrated that mannan receptors involved in many biological process especially in clearance of extracellular pathogens and peroxidases to keep homoeostasis (Vigerust et al. 2012).
Microsatellite marker (SSR marker) is one of the most successful molecular markers in the construction of sea cucumber genetic map and in diversity analysis. In this study, a total of 6417 microsatellites were detected in 5970 unigenes, 3216 of which were annotated. These microsatellites included 2969 (46.3%) dinucleotide motifs, 2924 (45.6%) trinucleotide motifs, 291 (4.5%) tetranucleotide motifs, 170 (2.6%) pentanucleotide motifs and 63 (1.0%) hexanucleotide motifs (Table 2). Among these motifs, the most abundant was (AT/AT), followed by (AC/GT), (AG/CT), (AAT/ATT), (AAG/CTT), (AGG/CCT), (ATC/GAT) and (AAC/GTT; Fig. 7). Among these 6417 SSRs, 2481 were successfully designed at least one primer pair using Primer3 v2.23 (Table 3). There were 2367, 2316 and 1734 SSRs identified in non-CDSs, CDSs and UTRs, respectively.
Figure 7. Frequency of classified repeat types (considering sequence complementary). Histograms depict the frequency of different SSR repeats.
Download figure to PowerPoint
Table 3. Summary of simple sequence repeats (SSRs) identified from the Apostichopus japonicus transcriptome
|Total number of sequences examined||94 704|
|Total number of identified SSRs||6417|
|Number of unigenes containing SSRs||5970|
|Number of unigenes containing SSRs with sufficient flanking sequence||2481|
|Number of known genes containing SSRs||3216|
Previous comparisons between the effects of SSRs and SNPs markers on genetic diversity (Russell et al. 2004) have been discussed in detail. Within or between wild populations, these two markers showed uncorrelated patterns of diversity and divergence which might be limited by the difference of each marker's intrinsic properties (Defaveri et al. 2013). Statistics analysis of our result showed that 3216 (50.1%) SSRs and 89 375 (59.1%) SNPs were detected from annotated contigs which would be priority candidates for marker development and useful for further molecular ecology, evolution, genetic or genomic studies in this species.
Gene-associated markers validation
A total of 20 putative SSRs and 23 SNPs were tested by PCR amplification from 32 individuals. Of the 20 SSRs primer pairs, 16 amplified the expected products and 13 showed polymorphism (Table S1, Supporting Information). The remaining 4 primers failed to amplify any PCR product, which was also observed in the development of EST-SSRs in the same species (Peng et al. 2009). The nonamplification was probably due to primer sequences spanning across introns, and/or containing mutations and/or indels.
Of the 15 SNPs primer pairs for 23 SNPs, three failed in PCR amplification, and two were monomorphic, suggesting that they might not be true SNPs, or their minor alleles were too rare to be detected, or the primers did not work. Collectively, 18 SNP loci amplified by 10 primer pairs were polymorphic in the tested population (Table S2, Supporting Information).
It is difficult to compare the final putative SNPs and the results obtained by Du et al. because sequences alignments showed that many mismatches and gaps existed between the final assembled transcripts obtained here and the 454 sequences assembled results. This is probably caused by the inherent nature of different samples, the difference between two sequencing platforms, as well as deeper sequencing conducted by this study. SNPs validation results showed that 18 of 23 (78.3%) selected putative SNPs got the expected results by PCR amplification and sequencing, suggesting that the majority of the putative SNPs are expected to be true. A large number of SNPs obtained in the present study compared with 454 pyrosequencing results by Du et al. are probably due to the large quantities of sequencing data and more extensive coverage, which contributed to a large number of low-abundance SNPs being found. We integrated the 454 and Sanger sequencing data into our Illumina data assembly. The final putative SNPs contained the majority of SNPs from different data.
Currently, applications of SNPs have been developed rapidly for their broader genome coverage and widespread genetic variants. The SNPs we obtained are expected to address ecology and evolution questions such as genetic differentiation, natural selection and speciation (Geraldes et al. 2013). Due to markers' great potentials in ecological and evolutionary analyses (Garvin et al. 2010), persistent efforts are required to focus on identifying species-specific or new type diagnostic SNPs of sea cucumbers using the data in this study.
Differentially expressed genes after LPS challenge
During the past decade, skin ulceration diseases caused by Gram-negative bacteria pose the most serious threat to cultivated A. japonicus (Eeckhaut et al. 2004). Identification and characterization of immune-related genes will help us to understand the mechanism of immune responses to bacteria in sea cucumbers. Additionally, as a species of echinoderms which comprise the sister group of chordates and occupy a critical and largely unexplored phylogenetic position, studies of A. japonicus immune responses are crucial in understanding the evolution of the immune system in metazoans.
Some immune-related genes in sea cucumbers have been characterized, and their expression patterns after LPS challenge have been analysed (Santiago et al. 2000; Santiago-Cardona et al. 2003; Ramírez-Gómez et al. 2008; Yang et al. 2009, 2010; Zhou et al. 2011). So far, large-scale identification of immune-related genes at the genome or transcriptome levels in sea cucumber has not been performed. In echinoderms, the main immune effector cells are the coelomocytes. Consequently, the transcripts of A. japonicus coelomocytes were quantified, and DEGs were analysed after LPS challenge in the present study.
Transcriptome comparison revealed 1330, 1347 and 1291 DEGs in the coelomocytes of A. japonicus at 4 h, 24 h and 72 h after LPS challenge, respectively (Table 4). Of these DEGs, 642, 890 and 837 were upregulated, while 688, 457 and 454 were downregulated at 4 h, 24 h and 72 h, respectively. At all three examined time points after LPS challenge, the total upregulated genes were more than downregulated genes. The imbalance was significant at 24 h and 72 h; the number of upregulated genes was nearly twice more than the downregulated. Some genes were upregulated or downregulated consistently at different time points after challenge (Fig. 8).
Table 4. Differentially expressed genes in the coelomocytes of Apostichopus japonicus after lipopolysaccharides (LPS) challenge
| ||4 h||24 h||72 h|
Approximately 58.4% (1802) of the DEGs in coelomocytes of A. japonicus were annotated. Gene Ontology annotation showed that 952 DEGs could be assigned with at least one GO term. The number of DEGs which have at least one GO term at three tested time points after challenge is shown in Fig. S2 (Supporting Information). KEGG pathway analysis showed that 1058 DEGs were mapped to 238 pathways, and metabolic pathways were the most abundant (190), followed by the focal adhesion (98), ECM–receptor interaction (71) and phagosome pathways (61). The DEGs contained some important immune-related genes, such as C-type lectin, lysozyme, interleukin 17C1 precursor, complement factors C3, Bf, H, Toll-like receptors, thioredoxin, tumour necrosis factor receptor-associated factor and lysosomal-associated transmembrane protein. The expression patterns of some DEGs corresponded to the previous data we obtained through qPCR detections (Yang et al. 2010; Zhou et al. 2011), and more should be further validated. These data provide important information of the coelomocytes in immune responses.
In conclusion, we performed a large-scale transcriptome sequencing of sea cucumber A. japonicus using an Illumina sequencing platform. The assembly transcriptome quality of A. japonicus was improved greatly by integrating the Illumina data and the public data, permitting gene discovery and characterization across a broad range of functional categories. A large number of potential genetic markers were identified from the A. japonicus transcriptome. The SNPs and SSRs identified here will provide sufficient resource for genetics and molecular ecology studies in sea cucumber. The DEGs in response to LPS were also identified. Such data will facilitate immune-related genes discovery and functional genomic studies of the sea cucumber.