HpBase: A genome database of a sea urchin, Hemicentrotus pulcherrimus
Abstract
To understand the mystery of life, it is important to accumulate genomic information for various organisms because the whole genome encodes the commands for all the genes. Since the genome of Strongylocentrotus purpratus was sequenced in 2006 as the first sequenced genome in echinoderms, the genomic resources of other North American sea urchins have gradually been accumulated, but no sea urchin genomes are available in other areas, where many scientists have used the local species and reported important results. In this manuscript, we report a draft genome of the sea urchin Hemincentrotus pulcherrimus because this species has a long history as the target of developmental and cell biology in East Asia. The genome of H. pulcherrimus was assembled into 16,251 scaffold sequences with an N50 length of 143 kbp, and approximately 25,000 genes were identified in the genome. The size of the genome and the sequencing coverage were estimated to be approximately 800 Mbp and 100×, respectively. To provide these data and information of annotation, we constructed a database, HpBase (http://cell-innovation.nig.ac.jp/Hpul/). In HpBase, gene searches, genome browsing, and blast searches are available. In addition, HpBase includes the “recipes” for experiments from each lab using H. pulcherrimus. These recipes will continue to be updated according to the circumstances of individual scientists and can be powerful tools for experimental biologists and for the community. HpBase is a suitable dataset for evolutionary, developmental, and cell biologists to compare H. pulcherrimus genomic information with that of other species and to isolate gene information.
1 INTRODUCTION
Phylogenetically, echinoderms share a common ancestor with chordates in the deuterostome clade. This provides us the opportunity to consider echinoderms as model organisms in discussing how deuterostomes have evolved from the common ancestors of bilaterians. In particular, because the centralization of the nervous system in deuterostomes is one of the largest questions in biology, the neurogenesis and neural functions in echinoderms become good targets in evolutionary studies (Garstang, 1922, 1984). In addition, a number of scientists using echinoderms with experimental‐biology techniques have answered fundamental questions in biology (Davidson, 2010; Evans, Rosenthal, Youngblom, Distel, & Hunt, 1983; Levine & Davidson, 2005). However, although four sea urchin species have mainly been used as model organisms in echinoderm experimental biology throughout the world, the genomes of only the North American species, Strongylocentrotus purpratus and Lytechinus variegatus, have been reported (Sergiev, Artemov, Prokhortchouk, Dontsova, & Berezkin, 2016; Sodergren et al., 2006). Therefore, the genomic information of the other two species, Hemicentrotus pulcherrimus in Asia and Paracentrotus lividus in Europe, is awaited because the information is now essential for evolutionary and experimental biology.
In Japan, H. pulcherrimus has a long history as a model organism in the field of biology, and reports on H. pulcherrimus have increased since the mid‐1900s. The number of scientists using H. pulcherrimus is now the second largest in the sea urchin experimental biology field, and a number of excellent studies using this species have been published. For example, papers about the mechanisms of body axis formation (Kominami, 1988), neurogenesis (Yaguchi & Katow, 2003), and cilia/flagella motility (Yano & Miki‐Noumura, 1981) have provided many important biological insights. In addition, genome editing was introduced into H. pulcherrimus first among echinoderms (Ochiai et al., 2010, 2012). These facts indicate that providing the fully accessible genomic information of this species will advance our understanding of many biological questions.
Here, we report the genomic information for a species of non‐American sea urchin, H. pulcherrimus. We also build a publicly available database for the H. pulcherrimus genome, HpBase, and discuss the possibility of H. pulcherrimus as a model organism for modern experimental, evolutionary, cell, and developmental biology.
2 MATERIALS AND METHODS
2.1 Genome materials and sequencing
Genomic DNA of H. pulcherrimus isolated from the sperm of a single male collected in Shimoda, Shizuoka, Japan, was used for constructing the genomic libraries: a pair‐end library with inserts of approximately 300 bp and two mate‐pair libraries with inserts of 3 kbp or 10 kbp. These genomic libraries were sequenced with Illumina HiSeq 2000. The library construction and sequencing were conducted by Taraka Bio Inc.
2.2 Transcriptome materials and sequencing
For transcriptome sequencing, we used the mixture of total RNA extracted from embryos from a single pair of a males and a female at several developmental stages (2, 14, 21, and 43 hr after fertilization). The paired‐end RNA‐seq library was sequenced with read length of 90 bp using Illumina HiSeq2000. The library construction and sequencing were performed by Beijing Genomics Institute (BGI).
2.3 Genome and transcriptome assemblies
de novo genome assembly for H. pulcherrimus was performed with ALLPATHS‐LG (Gnerre et al., 2011) with the parameter “HAPLOIDIFY=TRUE” by Taraka Bio Inc. The raw sequence reads of the genome with a quality score less 10 were removed from the analysis. Additional sequences of 10,307 BAC ends (accession IDs: AG981701–AG992007) were used to improve the assembly quality using SSPACE (Boetzer, Henkel, Jansen, Butler, & Pirovano, 2011). Gaps in the scaffold sequences were filled using SOAPdenovo version 1.12 Gap closer (Luo et al., 2012) implemented on the NGS data analysis platform, Maser (see below). The transcriptome reads were assembled using Trinity r2013‐02‐25 (Grabherr et al., 2011; Haas et al., 2013) after adapter trimming.
2.4 Gene predictions and annotation
We used MAKER2 (Holt & Yandell, 2011) to predict the gene structure from the genomic sequences. Briefly, MAKER2 first predicts gene models using two gene prediction programs and aligns these primary gene models to known sequences by BLAST (ncbi‐blast‐2.2.28+) searches, producing the integrated gene models. We collected the reference sequences of related species for BLAST searches from the NCBI nucleotide and protein database. Finally, MAKER2 generated nucleotide and protein sequences of the H. pulcherrimus gene model (nucleotide and protein sequences) and gene structure file.
We also predicted protein‐coding genes based on the transcriptome assembly using TransDecoder 2.0.1 (Haas et al., 2013). The resultant ORFs were clustered using the CD‐HIT program version v4.6.1 (Li & Godzik, 2006) under a similarity threshold of 100% to remove redundant or highly redundant sequences.
To infer gene function and orthology, we conducted reciprocal (bidirectional) BLASTP (NCBI‐blast‐2.2.30+) searches between H. pulcherrimus and S. purpuratus gene models (SPU_nucleotide.fasta deposited in Echinobase at http://www.echinobase.org). We also searched H. pulcherrimus gene models against the UniProt database and NCBI nucleotide collections to help annotate the gene models. From these BLAST search results, the top hit subject (E‐value ≤ 1e‐10) was retrieved.
2.5 Genome size estimation
The Illumina sequencing reads of the H. pulcherrimus genome were used for genome size estimation with a genome character estimator, GCE v1.0.0 (Liu et al., 2013), that takes into account the repeat contents and heterozygosity from the kmer distribution to estimate a more accurate genome size. We first calculated the distribution of the 17‐mer frequency for the Illumina raw reads using Jellyfish v1.1.11 (Marçais & Kingsford, 2011) and then ran GCE using the resultant 17‐mer frequency with optional settings of ‐c 100 and ‐H 1 (for adjusting non‐unique peaks and a heterozygous‐caused peak, respectively).
2.6 Assessment of assembly and annotation
The completeness of the genome/transcriptome assembly and gene prediction was assessed using BUSCO (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). BUSCO searches for 843 genes that are highly conserved in Metazoan from the target sequences and evaluates the completeness with that ratio (the higher the ratio, the higher the completeness). We ran BUSCO for the H. pulcherrimus genome/transcriptome assembly and gene models predicted from each sequence and for the genome assembly of S. purpuratus as a comparison. For S. purpuratus, genome assembly v3.1 (Spur_3.1.LinearScaffold.fa) released by EchinoBase was used for the assessment.
2.7 Gene model comparison and orthology prediction
To investigate how de novo gene models of H. pulcherrimus were supported by the other data and to infer those orthologs, we conducted reciprocal BLASTP searches of sequences between the genome‐derived and transcriptome‐derived gene models of H. pulcherrimus and S. purpuratus gene models. For S. purpuratus, peptide sequences, SPU_peptide.fasta, embedded in annotation database build 7 were used. An e‐value less than 1e‐10 was used as a cut‐off to define a BLAST hit. We also defined orthologous genes as pairs of genes that are mutually unique best hits in a reciprocal BLAST search.
2.8 Maser and Genome Explorer (GE) for NGS data analysis and visualization
For the data analysis, we used the NGS data analysis platform, Maser, which is managed by the National Institute of Genetics, Japan (http://cell-innovation.nig.ac.jp/maser/). Maser is equipped with over 400 bioinformatic tools, including tools for the annotation of the de novo genome and transcriptome. We also used a genome browser, Genome Explorer (GE), to visualize the genome assembly and gene features. GE is a web‐based genome browser that was developed by the National Institute of Genetics, Japan, and implemented on Maser. GE can be used for gene searches by gene ID, name, or synonyms and for retrieval of sequences, and therefore can be used as a simple database. We used this function of GE to visualize the H. pulcherrimus genome and to construct its database.
2.9 Web browser requirements
To use HpBase, Google Chrome is recommended since GE work best with Google chrome.
3 RESULTS AND DISCUSSION
3.1 Genome size estimation and sequence coverage
The size and sequence coverage of the H. pulcherrimus genome were estimated to be 807 Mbp and 103.9x, respectively, using the GCE (Liu et al., 2013). This estimation is comparable to 814 Mbp in S. purpuratus (Sodergren et al., 2006) but much larger than the total length (568.9 Mbp) of all genomic scaffolds of H. pulcherrimus (Table 1), implying that there still remain many unassembled reads probably due to the high repeat contents and heterogeneity in the H. pulcherrimus genome.
| Number | Average length (bp) | N50 (kbp) | Total length (Mbp) | Number of gaps (n) | |
|---|---|---|---|---|---|
| Genome | |||||
| Contig | 134,246 | 3,236 | 4.7 | 431.7 | — |
| Scaffold (v0.5) | 16,236 | 34,768 | 137.8 | 567.6 | 135,872,766 |
| Scaffold (v0.8) | 16,251 | 35,121 | 143.3 | 570.8 | 139,009,206 |
| Scaffold (v1.0) | 16,251 | 35,008 | 142.6 | 568.9 | 93,327,514 |
| Gene model | 24,860 | 1,685 | — | — | — |
| Transcriptome | |||||
| Contig | 124,330 | 832 | — | — | — |
| Gene model | 20,564 | 1,657 | — | — | — |
3.2 Genome assembly v0.5
The Illumina HiSeq 2000 platform generated 1,700,169,112 reads (100 bp each) (1,152,419,110 reads for the pair‐end library, 243,436,308 reads for the 3k mate‐pair library and 304,313,694 reads for the 10k mate‐pair library). As a result of the first assembly with these reads, we obtained 134,246 contigs with an N50 length of 4,745 bp and 16,326 scaffolds with an N50 length of 134,805 bp (Table 1).
3.3 Genome assembly v0.8
We reconstructed scaffolds using additional sequences of H. pulcherrimus BAC clones (AG981701–AG992007), improving the assembly quality slightly: the number of scaffolds decreased to 16,251, and the N50 length was extended to 143,251 bp (Table 1).
3.4 Genome assembly v1.0 (HpulGenome_v1_scaffold.fa)
Finally, we filled gaps in the genome assembly by using the SOAPdenovo gap closure module (Luo et al., 2012). As a result, the number of gaps (denoted by “N”) in the genome assembly v0.8 (139,009,206) was reduced by 23% in v1.0 (93,327,514), and the scaffold N50 was shortened slightly (from 143,251 in v0.8 to 142,559 in v1.0) (Table 1). This is the latest version of the assembly of the H. pulcherrimus genome shown in HpBase.
3.5 Gene prediction based on genome assembly
MAKER2 (Holt & Yandell, 2011) predicted 25,147 de novo gene models based on the latest version of the genomic assembly, HpulGenome_v1 (Table 1). Of these, invalid gene models that were too short (less than 50 aa) or contained multiple stop codons, or mitochondrial genes were removed. Finally, 24,860 gene models with average length of 1,685 bp were regarded as genome‐derived gene models (Table 1). The resultant sequences and gene structure datasets were summarized in “HpulGenome_v1_nucl.fa” (nucleotide sequences in fasta format), “HpulGenome_v1_prot.fa” (protein sequences in fasta format fasta), and “HpulGenome_v1.gff” (gene structure in general feature format). These datasets are deposited at the website for the H. pulcherrimus genome (see below). Based on the data in Echinobase (http://www.echinobase.org/) (Cameron, Samanta, Yuan, He, & Davidson, 2009; Kudtarkar & Cameron, 2017), there are 29,948 (June, 2017) total annotated genes of S. purpuratus and 22,105 total annotated genes of L. variegatus, indicating that the number of predicted genes in H. pulcherrimus is reasonable as data from the first assembled genome.
3.6 Transcriptome assembly (HpulTranscriptome)
A total of 53,376,104 RNA‐seq reads (100 bp each) were assembled into 124,330 contigs with an average length of 832.2 bp and an N50 length of 1,541 bp. These sequences are included in the dataset “HpulTranscriptome.fa” in HpBase. Based on the transcriptome assembly, 32,169 ORFs were identified by TransDecoder 2.0.1 (Haas et al., 2013). After the removal of highly redundant sequences (95% sequence similarity to each other), 20,564 gene models with an average length of 1,657 bp remained as transcriptome‐derived ORFs (Table 1) (deposited in HpBase as “HpulTranscriptome_nucl.fa” and “HpulTranscriptome_prot.fa”). In this study, we mixed total RNAs equally from four stages, maternal (2 hr), early blastula (14 hr), mesenchyme blastula (21 hr), and prism larva (43 hr), and assembled the reads. Thus, the data here lack the stage‐specific information and are not sufficient to cover all developmental events. However, combined with the gene prediction from the genome data, this transcriptome becomes a powerful tool in H. pulcherrimus biology because it contains the real ORFs (see below) and UTR regions, which are essential for designing knockdown experiments and/or for making multiple in situ hybridization probes. Together with the previously reported gastrula EST data (Shoguchi, Tokuoka, & Kominami, 2002), most of the embryonic transcripts should be covered and will be available in HpBase.
3.7 Assessment of assembly and gene prediction completeness
The completeness of the H. pulcherrimus genome and transcriptome assembly and gene predictions were assessed using BUSCO (Simão et al., 2015) (Table 2). As a result, BUSCO returned high degrees of completion (85.8%–97% of core metazoan genes) in both genome and transcriptome datasets. The degree of completion is higher for transcriptome sequences (94.4%–97.2%) than for genome sequences (85.8%–94.1%). The completeness is highest in transcriptome‐derived ORFs, while the number of predicted ORFs (20,564) is smaller than those of genome‐derived gene models (24,860) (Table 1).
| Dataset | Complete single‐copy BUSCOs | Fragmented BUSCOs | Complete single‐copy + fragmented BUSCOs | |||
|---|---|---|---|---|---|---|
| Number | % | Number | % | Number | % | |
| Hemicentrotus pulcherrimus | ||||||
|
Genome assembly (HpulGenome_v1_scaffold.fa) |
662 | 78.53 | 131 | 15.54 | 793 | 94.07 |
|
Gene model from genome assembly (HpulGenome_v1_prot.fa) |
511 | 60.62 | 211 | 25.03 | 722 | 85.65 |
|
Transcriptome assembly (HpulTranscriptome.fa) |
679 | 80.55 | 122 | 14.47 | 801 | 95.02 |
|
ORFs inferred from transcriptome assembly (HpulTranscriptome_prot.fa) |
707 | 83.87 | 112 | 13.29 | 819 | 97.15 |
| Strongylocentrotus purpuratus | ||||||
|
Genome assembly v3.1 (Spur3.1_Linearscaffolds.fa) |
749 | 88.85 | 62 | 7.35 | 811 | 96.20 |
3.8 Comparison of gene models
To investigate how genome‐derived gene models were supported by transcriptome data, we conducted reciprocal BLASTP searches of genome‐derived gene models against transcriptome‐derived ORFs of H. pulcherrimus. As a result, 11,557 (46.5%) and 13,123 (52.8%) H. pulcherrimus genome‐derived gene models were matched to a single or at least one transcriptome‐derived gene model under the E‐value of 1e‐10, respectively (Table 3). Only the remaining 180 (0.7%) had no similar gene, indicating that most of genome‐derived gene models were supported by transcriptome data.
| Query | Subject | Number of gene | ||
|---|---|---|---|---|
| Reciprocal best hit pair | Best hit under 1e‐10 | No hit | ||
| H. pulcherrimus gene model from genome assembly( (HpulGenome_v1_prot.fa) | H. pulcherrimus gene model from transcriptome assembly(HpulTranscriptome_prot.fa) | 11,557 | 13,123 | 180 |
| 46.5% | 52.8% | 0.7% | ||
| H. pulcherrimus gene model from genome assembly )(HpulGenome_v1_prot.fa) | S. purpuratus gene model (SPU_peptide.fa) | 13,772 | 10,877 | 211 |
| 55.4% | 43.8% | 0.8% | ||
Next, in order to infer gene orthology relationships, we conducted reciprocal BLASTP searches of gene models between H. pulcherrimus and S. purpuratus (Table 3). Among the 24,860 H. pulcherrimus gene models, 13,772 (55.4%) were matched to a single S. purpuratus gene model because they were reciprocal and unique best hit, and therefore, it is highly possible that these gene pairs are orthologous genes. In the remaining genes, 10,877 (43.8%) were matched to at least one S. purpuratus gene model with an E‐value of 1e‐10 or less, while 211 (0.8%) had no similar gene in S. purpuratus gene models. Based on these analyses, the genes predicted in this study using Maser are of sufficient quality for developmental, genetic, cell, and evolutionary biology.
3.9 Gene name annotation
Based on a BLASTP search result against S. purpuratus genes, we assigned gene names to de novo gene models of H. pulcherrimus according to the following criteria. (i) If both gene models of H. pulcherrimus and S. purpuratus matched each other as a single unique best hit, then we gave the same name as the S. purpuratus gene to the H. pulcherrimus gene model (e.g., Sp‐FoxQ2 → Hp‐FoxQ2). (ii) If the H. pulcherrimus gene model matched any S. purpuratus gene with an E‐value of 1e‐10 or less but not as a reciprocal best hit, we added “‐like” to the end of gene name (e.g., Sp‐FoxQ2 → Hp‐FoxQ2‐like). However, in both criteria (i) and (ii), there are some gene models for which the top hit subject of the S. purpuratus gene is not well annotated (namely, the gene name is not assigned and is denoted by “none”). Such gene models were named “hypothetical protein”, as were those that satisfied the third criteria. (iii) The remaining H. pulcherrimus gene models that do not have any similar gene in S. purpuratus were named “hypothetical protein” (e.g., Hp‐Hypp_0001). We also conducted BLASTP searches of H. pulcherrimus gene models against the UniProt database and NCBI nucleotide collections to help annotate the gene models. From these BLAST search results, the top hit subject (E‐value < 1e‐10) was retrieved. These results of gene name annotation and blast searches were listed in “HpulGenome_v1_annot.txt”.
3.10 Visualization of H. pulcherrimus genome on Maser
The genome assembly HpulGenome_v1 and gene structures were visualized on a genome browser, Genome Explorer (GE), implemented on Maser (Figure 1). We embedded the information of gene annotation listed in “HpulGenome_v1_annot.txt”; therefore, newly assigned gene IDs and names of H. pulcherrimus can be used as keywords for gene searches. Additionally, gene IDs and names of S. purpuratus can be used for gene searches, showing the top hit gene in BLAST for H. pulcherrimus recorded in the annotation file.

3.11 HpBase web site construction on Public Maser
Finally, we created a website, HpBase, for public release of the aforementioned datasets and genome viewer of H. pulcherrimus (Figure 2). This website provides a link to GE and direct download links to the datasets of genome assembly, gene models, and annotation. The following six links are shown on the top page of HpBase;

- Home: A link to go back to the home page of HpBase (Figure 2a).
- Gene Search: A link to go to the gene search page, on which people can search for the specific gene of interest by using the gene ID, gene name, synonyms, transcriptome ID, or SPU gene ID (Figure 2b).
- Homology Search: People can use a BLAST search to isolate sequence information of interest. Blastn, blastp, blastx, tblastn, and tblastx are available as BLAST programs. The database lists the whole genome scaffold, nucleotide and protein sequences for annotated genes from the genome and transcriptome data. This page is equipped with several parameters for more detailed analyses (Figure 2c).
- Genome Viewer: A link to the Genome Explorer (Figure 2d).
- Data Download: Raw data files for genome assembly, gene models, transcriptome, and annotation files are downloadable (Figure 2e).
- Protocols: This page contains PDFs showing a brief protocol for the experiment in each laboratory. All laboratories using H. pulcherrimus or other organisms are welcome to upload their protocols to this page through submission and a brief review by the database administrator. The contents of this page will help researchers using H. pulcherrimus by providing experimental information based on the experiences of each scientist and/or each laboratory. For example, “Reagents for microinjection” shows how to prepare the reagents for microinjection into the eggs of H. pulcherrimus. Based on the published data for H. pulcherrimus, the effective concentrations of morpholino oligo‐nucleotides (MO) to knockdown genes are between 100 μmol/L (e.g., Nodal‐MO1) (Yaguchi et al., 2010) and 3.8 mmol/L (e.g., Wnt7‐MO1)(Yaguchi, Takeda, Inaba, & Yaguchi, 2016) so far. These experimental concentrations shown in the protocol produced healthy morphants. To show the specificity of morpholino, using the second non‐overlapped morpholino should be the first choice. The second choice is using different species of sea urchins to confirm the same phenotype. The third choice is using control morpholino, but universal control‐MO, GFP‐MO, and randomized‐MO obtained from Gene Tools are not available at high concentration for H. pulcherrimus because 3.8 mmol/L aliquots of these morpholinos killed the embryos for unknown reasons while some morpholinos like Wnt‐7 mentioned above produced healthy embryos with the same concentration (Figure 2f).
4 CONCLUSION
Since the genome of S. purpuratus was read in 2006 (Sodergren et al., 2006), most sea urchin biology relies on S. purpuratus‐genome data stored in Echinobase, even though the researchers use different species such as H. pulcherrimus or P. lividus. The data were so powerful that the experimental speed for cloning interesting genes or genomic fragments has been dramatically improved. However, the differences in species affects detailed information such as the UTR‐sequence in mRNA or the non‐coding sequences in the genome, especially during the use of Asian or European species. Thus, the data shown here and stored in HpBase will help researchers using H. pulcherrimus in their detailed analyses in the fields of cell, developmental, and evolutionary biology.
HpBase contains a number of tools useful to the study using sea urchins. In particular, “Protocols” is a new challenge because the researchers can easily make their own protocols available to the public and refer to another lab's protocols. Since no strict review to check each protocol will be performed when they are deposited, the researchers must judge whether the information on the list is reliable or applicable for their own experiments. However, it will be very useful and powerful information for experimental biology.
The descendants of the male whose sperm was used for genomic sequencing are kept in two independent marine stations, Shimoda Marine Research Center, University of Tsukuba, and Marine and Coastal Research Center, Ochanomizu University. These sea urchins are available by request as a reference for genomic studies, and we also intend to make inbred strains and will read the genome in the future.
5 AVAILABILITY OF DATA
The sequence data were submitted to DDBJ/EMBL/GenBank databases (BioProject Accession PRJDB6441). The accessions of sequences of genome assembly HpulGenome_v1 and transcriptome assembly HpulTranscriptome are BEXV01000001‐01016251 and IACU01000001‐IACU01124330, respectively. The raw sequencing data of whole genome sequencing and RNA‐seq are deposited in the DDBJ Sequence Read Archive (DRA), with Accessions DRA006268 and DRA006267, respectively.
ACKNOWLEDGMENTS
We thank Masafumi Muraoka, Kazutoshi Yoshitake, Sadahiko Misu, and Norikazu Monma for their help in the analyses and HpBase construction. This research was partially supported by the Platform Project for Supporting Drug Discovery and Life Science Research from the Japanese Agency for Medical Research and Development (AMED) (to KI and SK) and a project of the Joint Usage/Educational Center (to MK) and Grant‐in‐Aid for Scientific Research (B) (26290070) (to TY and SY), (C) (25440101) (to SY), and Grant‐in‐Aid for Young Scientists (B) (23770241) (to SY) by the Ministry of Education, Culture, Sports, Science & Technology in Japan.




