Near‐complete genome assembly and annotation of the yellow drum (Nibea albiflora) provide insights into population and evolutionary characteristics of this species

Abstract Yellow drum (Nibea albiflora) is an important fish species in capture fishery and aquaculture in East Asia. We herein report the first and near‐complete genome assembly of an ultra‐homologous gynogenic female yellow drum using Illumina short sequencing reads. In summary, a total of 154.2 Gb of raw reads were generated via whole‐genome sequencing and were assembled to 565.3 Mb genome with a contig N50 size of 50.3 kb and scaffold N50 size of 2.2 Mb (BUSCO completeness of 97.7%), accounting for 97.3%–98.6% of the estimated genome size of this fish. We further identified 22,448 genes using combined methods of ab initio prediction, RNAseq annotation, and protein homology searching, of which 21,614 (96.3%) were functionally annotated in NCBI nr, trEMBL, SwissProt, and KOG databases. We also investigated the nucleotide diversity (around 1/390) of aquacultured individuals and found the genetic diversity of the aquacultured population decreased due to inbreeding. Evolutionary analyses illustrated significantly expanded and extracted gene families, such as myosin and sodium: neurotransmitter symporter (SNF), could help explain swimming motility of yellow drum. The presented genome will be an important resource for future studies on population genetics, conservation, understanding of evolutionary history and genetic breeding of the yellow drum and other Nibea species.

family ranking first in annual production among all marine fish aquaculture in China (Guo and Zhao, 2017)), account for an important part of sea food consumption in China, especially in east China. Yellow drum shares similar external characteristics, meat quality and flavors with large yellow croaker. Now, yellow drum is expected to partially replace the market of the large yellow croaker, which suffers from the shortage of germplasm resource due to overfishing and severe bacterial and virus diseases in aquaculture (Chen, Lin, & Wang, 2003;Han et al., 2016). Recently, the wild population of the yellow drum is declining due to overfishing and aquaculture of this fish is expanding to meet market needs (Cheng, Xu, Jin, & Wang, 2011). It is worthwhile to conduct researches on wild population conservation and genetic improvement of this fish. The genus Nibea consists of eleven recognized species, widely distributed in the Indo-West Pacific oceans (Lo et al., 2015;Lo, Liu, Nor, & Chen, 2017). Up to date, no reference genome for the Nibea species is present, hampering the studies on conservation and genetic investigation of these species.
In the present study, we took efforts to assemble the first and near-complete reference genome of an ultra-homologous gynogenic female yellow drum using Illumina short sequencing reads. Based on the assembled genome, the genome-wide nucleotide diversity of cultured yellow drum has been investigated, which can reflect the degree of inbreeding of aquaculture populations. Besides, the phylogenetic relationships of yellow drum with other teleosts were previously described (Lo et al., 2015(Lo et al., , 2017; however, there is not a study conducted so far for phylogenetic analysis based on wholegenome data, and we thus inferred the phylogeny with the yellow drum and other teleosts. And analysis of expanded and contracted gene families can aid to investigate the evolution of specific characteristics of this fish. In conclusion, the near-complete reference genome of the yellow drum we provided laid a solid foundation for the conservation, evolutionary studies, and genetic improvement of this species.

| Sample, library construction, and sequencing
An ultra-homozygous female yellow drum homozygous at 11 highly polymorphic microsatellite loci (Supporting Information Table S1) was produced through two generations of gynogenesis in a mariculture farm in Ningde, Fujian province, China. Genomic DNA was extracted from its muscle tissue using the traditional phenol-chloroform isolation method (Sambrook & Russell, 2006). DNA concentration was measured using Qubit 2.0 Fluorometer (Life Technologies, CA, USA). During the whole-genome shotgun sequencing, four paired-end libraries (two with insert sizes of 300 bp and two with insert sizes 450 bp) were constructed using the Illumina TruSeq Nano method (Illumina, CA, USA). Three mate-pair libraries with long-insert sizes of 2 kb, 5 kb, and 10 kb were generated using the Nextera Mate Pair Sample Preparation Kit (Illumina, CA, USA). In brief, 4 μg of genomic DNA was fragmented and tagged with biotinylated mate-pair junction adapters. Subsequently, the strand displacement reaction was performed and the DNA was purified with AMPure XP beads. The 2 kb, 5 kb, and 10 kb DNA fragments were selected using a 0.75% cassette for the BluePippin (Sage Science, MA, USA). DNA was recovered and ligated overnight at 30°C following the manufacturer's protocol. After incubation and heat inactivation, exonuclease was added to remove non-circularized DNA. The samples were subsequently end repaired, A-tailed, and adapter ligated, and enrichment was achieved with 10 rounds of PCR. All prepared libraries were sequenced (150 bp × 2) on a HiSeq X platform (Illumina, USA) at Novogene Bioinformatics Technology Co., Ltd (Beijing, China). For quality control, FASTQC (Andrews, 2013) was used to check quality of the raw reads; Trimmomatic (Bolger, Lohse, & Usadel, 2014) was employed to trim the adapter sequences and to remove low-quality bases (Phred score <20) of the paired-end reads, and reads shorter than 50 bp were discarded. All reads from mate-pair libraries were trimmed to a length of 50 bp to avoid potentially spanning the junctions of the DNA circularization.

| Genome assembly and evaluation
Platanus v1.2.4 (Kajitani et al., 2014) was employed to assemble the genome sequence of N. albiflora through three steps: contig assembling, scaffold construction, and gap closing. Although Platanus was originally developed for assembling highly heterozygous genomes, it is also with better performance in constructing super large scaffolds compared to other assemblers, irrespective of heterozygosity of the genomes (Kajitani et al., 2014). First, all the paired-end sequencing data with short inserts were used to construct de Bruijn graphs with automatically optimized k-mer sizes. Then, the mate-pair sequencing reads were used to link contigs into scaffolds. At gap-closing step, all sequencing reads were utilized to fill intra-scaffold gaps using F I G U R E 1 The yellow drum (Nibea albiflora). The picture of the yellow drum was provided by Shuqiu Xie (Mindong Fishery Research Institute of Fujian Province) paired-end information, where one end of a pair uniquely mapped to a scaffold and the other end located within a gap.
The quality of the assembled genome was evaluated as follows: (a) The sequencing reads with short inserts (300 bp and 450 bp) were realigned to the assembled genome using BWA v0.7.17 (Li & Durbin, 2009). (b) RNAseq data of a pool of multiple tissues were aligned to the genome assembly using STAR v2.5.3a (Dobin et al., 2013). (c) BUSCO (Benchmarking Universal Single-Copy Orthologs) v3.0 (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015) was employed to assess the completeness of the assembly using database of actinopterygii_odb9.

| Repeat-content identification and classification
To identify known repeats and transposable elements (TEs), RepeatMasker v4.0.6 (Smit, Hubley, & Green, 2013 was used to align the genome assembly against the Repbase TE library with the default parameters (Jurka et al., 2005). In addition, a de novo repeat library was constructed using RepeatModeler v1.0.4 (Smit & Hubley, 2008 with the genomic sequences of N. albiflora.
The RepeatMasker was employed again to identify repeat elements using the de novo repeat library. The repeat elements identified from Repbase library and de novo library were merged together and masked for further analysis.

| Gene prediction and annotation
The repeat-masked genome was utilized to predict gene structures through ab initio gene prediction, protein homology-based prediction, and transcript evidence. albiflora genome assembly using tblastn (E-value: 1e−5) (Altschul et al., 1997). Aligned sequences were submitted to Exonerate v2.2.0 (Slater & Birney, 2005) for defining precise splicing sites and exons. In addition, transcripts assembled from RNAseq, which was sequenced on an Illumina HiSeq 2,500 platform (2 × 125 bp) with pooled library of 11 tissues (Han et al., 2018), were employed to build comprehensive transcripts database and identify open reading frame (ORF) using PASA v2.1.0 program (Haas et al., 2003). After removing the genes with coding region length shorter than 150 bp, a consensus gene set was created from the genes of the three different sources using EVM (Evidence Modeler) v1.1.1 (Haas et al., 2008). All predicted genes were then aligned to NCBI nr, trEMBL, SwissProt, and KOG databases for function annotation using blastx (E-value: 1e−5) (Camacho et al., 2009).

| Calculation of nucleotide diversity
The nucleotide diversity of yellow drum was measured using four randomly sampled fish (two males and two females) from the aqua- were generated on an Illumina HiSeq X platform. The resequencing data of the four individuals were sequenced at 32.8 ~ 38.4 × coverage and were aligned to the genome assembly of the yellow drum using BWA v0.7.17 (Li & Durbin, 2009). Duplicated reads were subsequently removed, and aligned reads around insertions/deletions (Indels) were realigned using GATK v3.8.0 (Mckenna et al., 2010).
Platypus v0.8.1 (Rimmer et al., 2014) was utilized to call SNPs and small Indels (≤10 bp) on the refined bam files. Nucleotide diversity was calculated by counting the frequency of heterozygous sites in high-quality variants (genotype quality > 60 and 10 < depth < 100) in each individual.

| Phylogeny and gene family comparison
The protein sequences of 11 species, including Homo sapiens were aligned with each other among the 12 species using ClustalW v2.0 (Larkin et al., 2007). Based on the alignment information, the OrthoMCL v2.0.9 (Li, Stoeckert, & Roos, 2003) was employed to obtain pairs of one-to-one protein-coding orthologs, and the orthologs were used to construct the phylogenetic tree using maximum likelihood estimation (MLE) method with MEGA7 (Kumar, Stecher, & Tamura, 2016). The divergence time for N. albiflora and other vertebrates was estimated based on the time-calibrated divergence between human and chicken (320 million years ago (MYA), obtained from the TimeTree database (Hedges, Dudley, & Kumar, 2006)).
In gene family analysis, protein sequences coded by the longest isoform of each gene from all the 12 species were aligned to each other exhaustively using blastp (E-value: 1e−5). Based on the alignments, gene clustering and gene copy number of a certain gene family within each species were obtained using OrthoMCL v2.0.9 (Li et al., 2003). And node time was estimated via r8s software (Sanderson, 2003). Finally, gene family expansion and

| Genome size estimation and assembling
We generated a total of 154.2 Gb of raw data (~260 × coverage) for the gynogenetic yellow drum on an Illumina HiSeq X platform. After quality trimming and filtering, we retained 119.8 Gb data for genome assembling (Table 1) Figure S1). The resulting histograms were explored to estimate the size, repeat content, and heterozygosity of the yellow drum genome via the GenomeScope software (Vurture et al., 2017). The estimated genome size of the yellow drum was between 573.2 Mb (17-mers) and 581.0 Mb (29-mers) (Supporting Information Table S2). The results were consistent with the estimated genome size (595.7 Mb) using flow cytometry (Cao, Zheng, Wang, Liu, & Cai, 2015). The 29-mer analysis indicates that the yellow drum genome possesses a low level of repeat content (45.6 Mb, 7.9%). Figure 2, all the paired-end reads were first assembled into contigs with N 50 of 5.2 kb. Then, scaffolding was conducted using the mate-pair reads to link contigs into 75,113 scaffolds. And the contig N 50 reached 14.4 kb, and the scaffold N 50 was 2.1 Mb.

As shown in
Finally, the gap closing increased the contig N 50 to 50.3 kb, and the scaffold (>2 kb) N 50 was 2.2 Mb ( The assessment analyses again suggested that the genome assembly of the yellow drum was sufficiently accurate and near complete.

| Functional annotation of predicted gene
Based on integrated methods of ab initio prediction, protein-based homology and transcript evidence, we obtained a final gene set con-  The predicted genes and annotation rates were comparable to same family species such as large yellow croaker (Ao et al., 2015) (25,401 annotated genes), medaka (Kasahara et al., 2007) (20,141 non-redundant genes), and Atlantic cod (Star et al., 2011) (22,154 protein-coding genes).

| Nucleotide diversity of aquacultured yellow drum
Yellow drum aquaculture has existed for more than 20 years in east China. To get an idea of the population diversity of the aquacultured yellow drum, we measured the nucleotide diversity of four individuals randomly sampled from the aquaculture populations. As a result, the average nucleotide diversity of the four individuals was 0.26% (~1/390; range: 0.23%-0.27%), which was comparable to or even higher than that of wild populations of the previously reported marine fish (1/309 in herring (Barrio et al., 2016), 1/435 in coelacanth (Amemiya et al., 2013), 1/500 in cod (Star et al., 2011), and 1/700 in stickleback (Jones et al., 2012)). This indicates that the current aquaculture population of the yellow drum could be still genetically diverse. During 2004-2005, the nucleotide diversity of the wild-caught yellow drum was estimated using 33 polymorphic SNPs from three locations along the coastal regions of the Yellow Sea and the East China Sea, and it ranged from 0.30% to 1.16% (Qingdao), from 0.33% to 1.27% (Zhoushan), and from 0.42% to 1.56% (Xiamen), respectively (Han, Gao, Yanagimoto, & Sakurai, 2008). The nucleotide diversity in wild population is much higher than that found in this study, suggesting the aquaculture population might have suffered from inbreeding.

| Phylogenetic analysis and gene family
To determine the evolutionary position of N. albiflora, we performed systematic genome comparisons among N. albiflora and 11 other vertebrates. The phylogenetic tree was finally constructed based on 1,070 pairs of one-to-one protein-coding orthologs using the MLE method. The yellow drum is evolutionarily close to the large yellow croaker, and the divergence time between them is about 19.7 MYA ( Figure 3). Among all other teleosts, the three-spined stickleback is a close sister group with the yellow drum and large yellow croaker (MYA: 64.8), which is consistent with the previous phylogenetic analysis of Sciaenidae and Gasterosteiformes (Ao et al., 2015). and is directly involved in many biological functions such as muscle contraction, cardiac regulation, cell movement, and signal transduction in animals (Fu & Zhang, 2008). Yellow drum is closer to large yellow croaker in phylogeny ( Figure 3). However, it has been noted that yellow drum has better swimming ability than the large yellow croaker in the mariculture practice, which could be partially attributed to the expansion of family of myosin genes. Besides, sodium: neurotransmitter symporter (SNF) family genes (such as SLC6A1, SLC6A6, SLC6A8, SLC6A11, SLC6A12, and SLC6A13) were also expanded in yellow drum in contrast to large yellow croaker. Those genes are essential for the release, re-uptaking, and recycling of neurotransmitters at synapses (Attwell & Bouvier, 1992). SLC6A1, SLC6A11, and SLC6A13, widely distributed in brain, encode sodium-dependent transporters that uptake gamma-aminobutyric acid (GABA) (Zhou et al., 2012). in the skeletal muscle, play key roles in optimal uptake and osmotic regulation (Borden, Smith, Gustafson, Branchek, & Weinshank, 2002;Sora et al., 1994). Those expanded genes might be also important for motility and osmotic regulation in the yellow drum. The detailed mechanism should be investigated in the future.

| CON CLUS IONS
Here, we report the first genome assembly of the yellow drum, which has been demonstrated to be highly accurate and near complete. The results showed that our strategy of using a homozygous individual (gynogen) and better assembly algorithm (e.g., Platanus) in de novo genome assembling was powerful with just Illumina short reads. The near-complete genome and its annotation allowed us to perform population genetics and evolutionary analyses in the yellow drum. These resources will also be necessary in conservation efforts and genetic breeding in this fish species. In addition, this study will illuminate undiscovered genetic characteristics of Nibea genus in the future studies on these species. Furthermore, the assembly is fragmented in contigs and is not ideally used for chromosome-scale assembling. Hence, the genome still needs to be improved in the future with advanced technologies (e.g., long-read sequencing and optical mapping). F I G U R E 3 Phylogenetic tree and orthologous genes in Nibea albiflora and 11 other vertebrates. Blue numbers in the phylogenetic tree indicate the divergence time (MYA, million years ago), and the green and red numbers represent the expanded and contracted gene families, respectively. The histogram shows different types of orthologous relationships. "1:1:1" means universal single-copy genes; "N:N:N" means orthologs exist in all genomes; "SS" means species-specific genes; and "Others" means orthologs that do not fit into the other categories

ACK N OWLED G M ENTS
Research & Key Laboratory of Sustainable Development of Marine Fisheries, Ministry of Agriculture, P.R. China, CAFS (NO. 2017HY-

XKQ01) and Key Projects of the Xiamen Southern Ocean Research
Center (14GZY70NF34).

CO N FLI C T O F I NTE R E S T S
The authors declare that they have no competing interests.