Assembly of continuous high-resolution draft genome sequence of Hemicentrotus pulcherrimus using long-read sequencing

The update of the draft genome assembly of sea urchin, Hemicentrotus pulcherrimus, which is widely studied in East Asia as a model organism of early development, was performed using Oxford nanopore long-read sequencing. The updated assembly provided ∼600 Mb genome sequences divided into 2,163 contigs with N50 = 516 kb. BUSCO completeness score and transcriptome model mapping ratio (TMMR) of the present assembly were obtained as 96.5 % and 77.8 % respectively. These results were more continuous with higher resolution than those by the previous version of H. pulcerrimus draft genome, HpulGenome_v1, where the number of scaffolds = 16,251 with a total of ∼100 Mb gaps, N50 = 143 kb, BUSCO completeness score = 86.1 %, and TMMR = 40.61. The obtained genome contained 36,055 gene models that were consistent with those in other echinoderms. Additionally, two tandem repeat sequences of early histone gene locus containing 47 copies and 34 copies of all histone genes, and 185 of the homologous sequences of the interspecifically conserved region of the Ars insulator, ArsInsC, were obtained. These results provide further advance for genome-wide research of development, gene regulation, and intranuclear structural dynamics of multicellular organisms using Hemicentrotus pulcherrimus.


| INTRODUCTION
Sea urchins are model organisms commonly used for studying early development and morphogenesis due to their evolutionary position as diverged at the early period of deuterostome evolution.For example, Hemicentrotus pulcherrimus, which is one of the typical sea urchins in East Asia, has been widely investigated to reveal developmental systems of multicellular organisms such as intranuclear chromatin structural dynamics, establishment of left-right asymmetric body axis, and nervous system formation (Matsushita et al., 2017;Takemoto et al., 2016;Yaguchi & Katow, 2003).The gene regulatory network controlling endomesoderm specification in sea urchin embryos was also extensively studied using Strongylocentrotus purpuratus (Davidson et al., 2002;Oliveri & Davidson, 2004).
The detailed features of genome sequences of various organisms promote research for a variety of universal and specific activities of entire living systems.Therefore, similar to various organisms, sea urchins draft genomes have also been assembled; for example, S. purputarus in 2006 (Sodergren et al., 2006) (the latest version was updated in 2019), H. pulcherrimus in 2018 (Kinjo et al., 2018), Lytechinus variegatus in 2020 (Davidson et al., 2020), Temnopleurus reevesii in 2022 (Kinjo et al., 2022), and Paracentrotus lividus in 2023 (Marletaz et al., 2023).
Recent research using the Hi-C method and its derivatives (Lieberman-Aiden et al., 2009), which has progressed rapidly in the last decade, has revealed that the expression of each gene is affected by various cis-regulatory elements and their higher-ordered structures such as enhancer-promoter loops, nucleosome exclusive nonlooping insulator sequences (NENLIS) (Matsushima et al., 2019), topologically associated domains (TADs), A/B compartments, etc. (Lieberman-Aiden et al., 2009).
To analyze the effects of such cis-regulatory structures at various scales ranging from several kb to Mb, a highly contiguous genomic sequence longer than several Mb of the target organism is required.
However, it is difficult to construct such a long contiguous scaffold that contains a transcribed gene region and its cis-regulatory elements such as enhancers and insulators with conventional shortread-based genome assembly.For example, the previously reported draft genome sequence of H. pulcherrimus, called HpulGenome_v1, did not contain the Ars insulator sequence located 2 kb upstream of the arylsulfatase (Hp-Ars) gene (Akasaka et al., 1999;Takagi et al., 2012), which has an interspecies-conserved AT-rich core region (ArsInsC) known as a typical NENLIS (Matsushima et al., 2019).Furthermore, the Hp-Ars gene-containing scaffold in the HpulGenome_v1 did not include another important regulatory region, the C15 enhancer (Iuchi et al., 1995;Sakamoto et al., 1997).Since repetitive sequences are dispersed throughout the genome, the presence of repeat sequences leads to difficulty in the short-read-based genome assembly.
In addition, it was impossible to reconstruct long repeat sequences far exceeding several 100 bp, which frequently appear in multicellular organism genomes; therefore, many gaps had to exist in the scaffold.Thus, although it has been suggested that the H. pulcherrimus genome contains long repetitive sequences consisting of 10-100 of tandem sequences of genes encoding histones H1, H2A, H2B, H3, and H4, HpulGenome_v1 does not contain such long repeat regions.However, the above problems are being overcome by technological advances in long-read sequencing, which yields sequences of 10-100 kb per read.At present, reads obtained by long-read sequencing appear to contain errors 100 times more frequently than those obtained by short-read sequencing (Logsdon et al., 2020;Rang et al., 2018).However, the hybrid assembly method, which assembles short reads from the same organism while correcting errors in long reads, may enable highly accurate genome assembly with long continuity by taking advantage of both (Holley et al., 2021).
Therefore, in this study, the H. pulcherrimus draft genome sequence was updated by hybrid assembly using both long reads obtained using the Oxford Nanopore sequencer and short reads used in conventional H. pulcherrimus draft genome assembly.In this paper, we compared the updated draft genome sequence with HpulGen-ome_v1, and provided the annotation of genes by comparison with related species.We also show that the updated draft genome contains the ArsInsC sequence at upstream of the Hp-Ars gene and its homologous sequences in the updated genome.Furthermore, we identified two distinct long tandem repeats of early histone genes.
Such improvements of the draft genome sequence will drive further progress of studies for physiological and structural-dynamical features of gene regulatory behaviors in H. pulcherrimus.

| Sample collection and DNA extraction
Adult males of H. pulcherrimus were collected from the sea around Etajima City, Hiroshima Prefecture, Japan, with permission of the Hiroshima Fishery Cooperative, and the genomic DNA (gDNA) was extracted from the sperm of three adult individuals.The H. pulcherrimus genomic DNA was extracted by using the Blood & Cell Culture DNA Midi Kit (QIAGEN), and short genomic DNA segments were removed by the Short Read Elimination XL Kit (Circulomics Inc.).
To obtain extremely long reads, we followed the protocol described by Logsdon et al. (2020).

| Sequencing library preparation
Nanopore sequencing libraries were individually prepared from gDNA derived from each sample using the Rapid Sequencing Kit (SQK-RAD004; Oxford Nanopore Technologies [ONT]).Nanopore sequencing were performed on an ONT MinION sequencer using R9.4.1 flow cells and FAST5 files were basecalled to FASTQ files using ONT MinIT (MNT-001).In the following, each of the generated FASTQ formatted reads was named ONT-read.
Error correction of ONT read data derived from each library was individually performed using Ratatosk v0.7.0 (Holley et al., 2021) with preprocessed Illumina short reads (160.47Gb in total).

| Assessment of draft genome sequences
To evaluate the completeness of the genome assembly, highly conserved genes in metazoan were searched using BUSCO v5.3 (Manni et al., 2021).As an additional analysis, predicted transcriptome models (nonredundant open reading frames [ORFs] derived from transcriptome assembly in HpBase, HpulTranscriptome_nucl.fa) were mapped to each draft genome sequence using hisat2 v2.2.1 (Kim et al., 2019).

| Genome assembly and completeness of draft genome sequence
Sequencing by the ONT MinION sequencer generated 17.29 million reads with 51.54 Gb of sequenced data.By the error correction and filtering out ONT reads, 2.76 million reads with 35.26 Gb were generated for genome assembly.The draft genome sequence was updated by assembling and polishing these reads by Raven, Flye, Wtdbg2, and MAECI (Table 1, Table S1).Here, the contig (contig ID: Utg196084) that was most homologous to the mitochondrial genome of H. pulcherrimus [NC_023771.1] by BLASTN was removed.The updated draft genome sequence consists of 2,163 contigs with a total length of $626.4Mb and N50 length of $515.7 kb (Table S1), with a smaller number of contigs and more continuously assembled than HpulGenome_v1; HpulGenome_v1 consists of $16,000 scaffolds (contigs) with total length of $600 Mb containing a total of $100 Mb gaps and N50 = 142.6 kb (Table S1).
Analysis of genome completeness of the updated draft genome by BUSCO using the metazoa_odb10 showed a larger value of 96.5% (89.9% single copy and 6.6 duplicated) than that of HpulGenome_v1 (86.1% [84.8% single copy and 1.3 duplicated]).Additionally, the mapping ratio of transcriptome models of the present draft genome also obtained a larger value of 77.78% than that of HpulGen-ome_v1 (55.35%).

| Gene prediction based on genome assembly and gene annotation with related species
The analysis by BRAKER2 and transcriptome data (DRR107783) predicted 46,914 genes exist in the updated draft genome sequence.By eliminating the predicted genes that were too short (less than 50 amino acids), contained multiple stop codons, or were encoded on the mitochondrial contig, 46,826 gene models were obtained in the updated draft genome, with 5.85 exons per gene model and an average coding sequence (CDS) length of 1,249.23 bp (Table S2).The inference from the orthology relationships between the present gene models and gene models in other sea urchins through reciprocal BLASTP searches showed that 36,055 gene models (number of reciprocal best hit pair, 20,434 and number of not reciprocal but best hit, T A B L E 1 Summary of genome assembly, BUSCO completeness and gene annotation in HpulGenome_v1 and our draft genome sequence.2 and S2).

| Detection of early histone gene loci with long tandem repeats
The H. pulcherrimus genome has been suggested to contain two nonallelic early histone gene loci that are the long repetitive sequences consisting of 10-100 of tandem sequences of genes encoding histones (Table S3) (Matsushita et al., 2017).However, such long tandem repeat sequences could not be detected by recent short-read-based genome assemblies; in HpulGenome_v1, BLASTN searches did not find any genomic regions homologous even to single repeat sequences of early histone loci.
In contrast, two contigs containing the regions of long tandem repeats of histone genes were found on the updated draft genome sequence; where one contig (contig ID: Utg198178, with 479,333 bp) involved 47 tandem repeats, and the other contig (contig ID: Utg200276, with 127,611 bp) involved 34 tandem repeats (Table S3).

| Regulatory sequences of arylsulfatase gene
We have investigated the regulatory mechanism of the transcription of the arylsulfatase (Hp-Ars) gene.The promoter (À252 to +38) is the minimum region required for temporal expression (Iuchi et al., 1995;Morokuma et al., 1997).The C15 enhancer required for large enhancement of the expression is present in the first intron (Iuchi et al., 1995;Sakamoto et al., 1997).The polypyrimidine sequence (À2,201 to À1,680) that can form an intramolecular triplex structure (Sakamoto et al., 1996;Yamamoto et al., 1994) and the Ars insulator (À2,686 to À2,109) (Akasaka et al., 1999) are present in the upstream region.
In HpulGenome_v1, scaffold989 contains the the Hp-Ars gene.A BLASTN search revealed that this scaffold includes all exons, the promoter, and upstream sequences to the polypyrimidine region.
However, it does not include the C15 enhancer in the first intron and the Ars insulator in the upstream flanking region, which play important roles in the transcriptional regulation.
On the other hand, in the updated draft genome, the Hp-Ars gene-containing contig (contig ID: Utg200732) contains all exons and regulatory sequences including Ars insulator and C15 enhancer.Furthermore, the Hp-Ars gene has been reported to include various types of repetitive sequences such as two direct repeats (DIR, which are called DIR1and DIR2, respectively) and an inverted repeat (INV) in the upstream region (Akasaka et al., 1994).All of these repetitive sequences could be found in contig Utg200732.Interestingly, when we searched homologous sequences of the inverted repeat (À475 to À217) with 90-120 bp of the unit length, >90% identity and >90% query coverage among inverted repeats were obtained from IRF v3.08 (Warburton et al., 2004), and nine homologous inverted repeat sequences were found in the updated draft genome (Table S4).
Although seven out of these homologous inverted repeats were present in intergenic regions, there was no tendency in their position.
Furthermore, eight homologous sequences of the direct repeat DIR1 (À3,440 to À3,109) with two repeat units, >90% identity, and >95% query coverage were found, and 205 homologous sequences of another direct repeat DIR2 (À3,096 to À2,592) with two to five repeat units, >90% identity, and >95% query coverage were found in the updated draft genome (Table S5).These DIR1 and DIR2 homologs tend to exist in the upstream region of the nearest genes (Figure 1).

| Homologous genome region to Ars insulator
The Hp-Ars gene has been known to involve the Ars insulator, which includes the interspecifically conserved ArsInsC sequence (Akasaka et al., 1999;Takagi et al., 2012) (Table S4) that is a typical example of NENLIS (Matsushima et al., 2019).In HpulGenome_v1, the Hp-Ars gene was present, but the ArsInsC sequence was not present.In contrast, the updated draft genome contained ArsInsC from 46,417 to T A B L E 2 Summary of BLASTP searches with protein models of HpBase and other sea urchin.*Proteins fasta file of Paracentrotus lividus was created using gffread v0.12.7 (Pertea & Pertea, 2020).The genome and annotation in GTF format of P. lividus were downloaded from Zenodo (https://zenodo.org/record/7459274).
46,236 (minus strand) in contig ID: Utg200732, which is 2-kb upstream of the Hp-Ars gene region.ArsInsC was originally identified by interspecific sequence comparison with 75% identity in the 30 bp of the sliding window and does not have specific motifs except its interspecies-conserved AT-rich sequence (Takagi et al., 2012).Therefore, to find candidate sequences for functional homologs of ArsInsC, nucleotide sequences that have more than 90% identity to ArsInsC were searched.BLASTN analysis found 185 sequences that have more than 90% identity to ArsInsC (Table S6).As observed in the Ars insulator, the 50-bp regions surrounding the ArsInsC homologous sequences tended to have high GC contents (50%-80% GC) (Figures 2 and S1), and 121 homologous sequences (65.4%) contained a G(C)-stretch of more than 8 bp in these GC-rich regions (Figure S1).
Thus, the sequence properties of the Ars insulator are well conserved in these homologous sequences, suggesting that these homologous sequences may function as insulators in the sea urchin genome.

| Short tandem repeat
It has been reported that short tandem repeats (STRs) are enriched in gene promoters and variable repeat number is associated with gene expression (Sawaya et al., 2013;Vinces et al., 2009).Furthermore, approximately 90% of transcription factors preferentially bind to STRs (Horton et al., 2023).Therefore, we also analyzed the presence of STRs in the updated draft genome using TRF v4.09 (Benson, 1999).
According to the definition of microsatellite by Sawaya et al. (2013), uninterrupted consecutively repeated units of one to four nucleotides that are at least 12 bp in length were searched.After searching STRs that obtained from TRF results and matched this definition, a total of 111,553 mononucleotide repeats, 66,665 dinucleotide repeats, 61,983 trinucleotide repeats, and 62,448 tetranucleotide repeats were found in the updated draft genome (Table S7).However, the enrichment of STRs in the proximity of genes or the translation start sites was not observed.

| Microsynteny plots
Microsyntenies were plotted for the five fragment sets of H. pulcherrimus contig and S. purpuratus scaffold in order of having the most reciprocal BLASTP best hit pairs (Figure S2).While genome-wide synteny plots showed little interchromosomal gene order rearrangement (data not shown), microsynteny plots showed intrachromosomal gene order rearrangement, similar to the results of genome-scale synteny plots in other sea urchins (Marletaz et al., 2023).

| DISCUSSION
Long-read sequencing and hybrid assembly using long and short reads were performed to update the H. pulcherrimus draft genome with more continuous and higher accuracy than the recently reported HpulGen-ome_v1.This study resulted in a draft genome of H. pulcherrimus with higher values of the metrics by BUSCO and higher mapping rate of the transcriptome model than HpulGenome_v1.
This draft genome, involving 46,826 gene models, is larger than HpulGenome_v1 containing 24,860 model genes.Here, 36,055 of these gene models were annotated based on the analysis with the homologous gene models of related sea urchins.
The present draft genome contained two long repeat sequences, including the tandemly connected early histone genes.This result may drive the progress of genomic and epigenomic analyses to reveal the developmental stage-dependent changes in intranuclear location and interactions of early histone loci (Matsushita et al., 2017).
Additionally, this draft genome also contained ArsInsC sequence and more than 100 ArsInsC homologous sequences.Furthermore, the number of ArsInsC homologous sequences with more than 75% identity in the updated draft genome was 241.In vertebrates, the CTCFbinding sequence is known as a typical insulator sequence responsible for gene expression control.Recently, however, it has been suggested that sea urchin CTCF functions during mitosis rather than interphase (Watanabe et al., 2023).Therefore, ArsInsC and its homologous sequences may play an important role in the control mechanism of sea urchin interphase gene expression, and the present draft genome may promote research to clarify the mechanism.
The updated draft genome involved more than 10,000 nonannotated genes.These genes were expected to include In this study, HpulGenome_v1 was updated using Oxford Nanopore long-read sequencing, resulting in a smaller number of contigs with more continuous assembly.Since each contig was expected to contain a complete set of cis-regulatory sequences, we believe that this updated draft genome improves the applicability of H. pulcherrimus genome information to various genome-wide analyses.
For genome-wide analysis of H. pulcherrimus, a genomic sequence that can be widely used is required.Therefore, to perform genomewide analysis reflecting the degree of polymorphism in H. pulcherrimus, we used genomic DNA isolated from multiple individuals of H. pulcherrimus collected in Hiroshima.After error correction and genome assembly, there were 1,625 mixed bases on the updated draft genome sequence (Table S9), and the draft genome size was restrained to $600 Mb, although the estimated genome size is $800 Mb (Kinjo et al., 2018).We expect that excessive redundancy is suppressed, and the updated draft genome represents polymorphism in H. pulcherrimus.On the other hand, genes determined to be duplicated in BUSCO and the contigs containing them are likely to be redundant.However, we did not remove these contigs because they may contain important information such as polymorphic regions when editing the genome of H. pulcherrimus.In addition, it is difficult to perform quantitative haplotype-resolved analysis based on k-mer distribution with raw reads obtained from the Oxford Nanopore sequencer because of the high error rate of sequencing by this sequencer.Thus, we plan to correct the redundancy of draft genome sequences by, for example, using the Hi-C method, in the future.

1
Stacked histogram of genomic distances between DIR1 homologs and their nearest gene (a), and that between DIR2 homologs and their nearest gene (b).The height of red and blue bars indicates counts of direct repeats (DIR) homologs at the upstream and downstream region of the nearest gene, respectively.Distances between DIR homolog and its nearest gene was defined as the genomic distance between the centers of them.F I G U R E 2 Heat map of the number of ArsInsC homologs with each GC content in the upstream 50 bp and downstream 50 bp regions.The histograms represent the projection onto the upstream GC content axis and the downstream GC content axis, respectively.1.0.2008-2015, http://www.repeatmasker.org) and RepeatMasker v4.1.5(Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0.2013-2015, http://www.repeatmasker.org) for the updated draft genome of H. pulcherrimus, and the genomes of S. purpuratus, L. variegatus, and P. lividus, respectively.Most TEs were not annotated as known elements, but the classification of TEs and their proportions indicated that the trends of TEs in the updated draft genome sequence were closer to those in other sea urchins than in HpulGen-ome_v1 (Table noncoding RNA-derived or H. pulcherrimus-specific genes.The annotation of these gene models should be progressed by the comparison among genes in various organisms and the function analysis by experiments in the future.Additionally, a large number of duplicated genes were found by analysis with BUSCO in the present draft genome compared with HpulGenome_v1.The reason for this was not clear, nor whether each duplication is true in H. pulcherrimus.Further validation should be a future research direction.Sawaya et al. (2013) suggested that enrichment of STRs in promoters by calculating distances from transcription start site to STR.However, since gene model estimation based on mapping results of bulk RNA-sequencing did not show the transcription start site and the distribution of the 5 0 -UTR length in the H. pulcherrimus genome, we analyzed the distribution of distances from translation start sites to STRs.Although the enrichment of STRs in the proximity of genes and translation start site was not observed in this study, it is necessary to examine the enrichment of STRs in gene promoters after the estimation of transcription start site from other sequencing methods such as CAGE-sequencing or Iso-sequencing.