Finding Nemo’s Genes: A chromosome‐scale reference assembly of the genome of the orange clownfish Amphiprion percula

Abstract The iconic orange clownfish, Amphiprion percula, is a model organism for studying the ecology and evolution of reef fishes, including patterns of population connectivity, sex change, social organization, habitat selection and adaptation to climate change. Notably, the orange clownfish is the only reef fish for which a complete larval dispersal kernel has been established and was the first fish species for which it was demonstrated that antipredator responses of reef fishes could be impaired by ocean acidification. Despite its importance, molecular resources for this species remain scarce and until now it lacked a reference genome assembly. Here, we present a de novo chromosome‐scale assembly of the genome of the orange clownfish Amphiprion percula. We utilized single‐molecule real‐time sequencing technology from Pacific Biosciences to produce an initial polished assembly comprised of 1,414 contigs, with a contig N50 length of 1.86 Mb. Using Hi‐C‐based chromatin contact maps, 98% of the genome assembly were placed into 24 chromosomes, resulting in a final assembly of 908.8 Mb in length with contig and scaffold N50s of 3.12 and 38.4 Mb, respectively. This makes it one of the most contiguous and complete fish genome assemblies currently available. The genome was annotated with 26,597 protein‐coding genes and contains 96% of the core set of conserved actinopterygian orthologs. The availability of this reference genome assembly as a community resource will further strengthen the role of the orange clownfish as a model species for research on the ecology and evolution of reef fishes.

The orange clownfish is one of 30 species of anemonefishes belonging to the subfamily Amphiprioninae within the family Pomacentridae (damselfishes). The two clownfishes, A. percula (orange clownfish or clown anemonefish) and A. ocellaris (false clownfish or western clown anemonefish), form a separate clade, alongside Premnas biaculeatus, within the Amphiprioninae (Li, Chen, Kang, & Liu, 2015;Litsios, Pearman, Lanterbecq, Tolou, & Salamin, 2014). The two species of clownfish are easily distinguished from other anemonefishes by their bright orange body coloration and three vertical white bars. The orange clownfish and the false clownfish have similar body coloration, but largely distinct allopatric geographical distributions . The orange clownfish occurs in northern Australia, including the Great Barrier Reef (GBR), and in Papua New Guinea, Solomon Islands and Vanuatu, while the false clownfish occurs in the Indo-Malaysian region, from the Ryukyu Islands of Japan, throughout South-East Asia and south to north-western Australia (but not the GBR).
Like all anemonefishes, the orange clownfish has a mutualistic relationship with sea anemones. Wild adults and juveniles live exclusively in association with a sea anemone, where they gain shelter from predators and benefit from food captured by the anemone (Fautin, 1991;Fautin & Allen, 1997;Mebs, 2009). In return, the sea anemone benefits by gaining protection from predators (Fautin & Allen, 1997;Holbrook & Schmitt, 2005), from supplemental nutrition from the clownfish's waste (Holbrook & Schmitt, 2005) and from increased gas exchange as a result of increased water flow provided by clownfish movement and activity (Herbert, Bröhl, Springer, & Kunzmann, 2017;Szczebak, Henry, Al-Horani, & Chadwick, 2013).
Clownfish social groups typically consist of an adult breeding pair and a variable number of smaller, size-ranked juveniles that queue for breeding rights (Buston, 2003). The breeding female is larger than the male. If the female disappears, the male changes sex to female and the largest nonbreeder matures into a breeding male. The breeding pair lays clutches of demersal eggs in close proximity to their host anemone. Eggs hatch after 7-8 days and the larvae disperse into the open ocean for a period of 11-12 days, at which time they return to the reef and settle to an anemone.
The close association of clownfish and other anemonefishes with sea anemones makes them excellent species for studying aspects of marine mutualisms and habitat selection. The easily identified and delineated habitat they occupy, along with the ease with which the fish can be observed in nature, makes them ideal candidates for behavioural and population ecology. The unique capacity to collect juveniles immediately after they have settled to the reef from their pelagic larval phase also makes them ideally suited to testing long-standing questions about larval dispersal and population connectivity in reef fish populations. Using molecular techniques to assign parentage between newly settled juveniles and adult anemonefishes, recent studies have been able to describe for the first time the spatial scales of dispersal in reef fish and its temporal consistency (Almany et al., 2017). The ability to map the connectivity of clownfish populations in space and time has also opened the door to addressing challenging questions about selection, fitness and adaptation in natural populations of marine fishes (Pinsky et al., 2017;Salles et al., 2016). Finally, the orange clownfish is one of the relatively few coral reef fishes that can easily be reared in captivity (Wittenrich, 2007). Consequently, it has unrivalled potential for experimental manipulation to test ecological and evolutionary questions in marine ecology (Dixson et al., 2014;Manassa et al., 2013), including the impacts of climate change and ocean acidification (Nilsson et al., 2012). Increasingly, genomewide methods are being used to test ecological and evolutionary questions and this is particularly true for coral reef species in the wake of anthropomorphic climate change and its effects on these sensitive ecosystems (Stillman & Armstrong, 2015).
To date, genome assemblies of two anemonefish, A. frenatus (Marcionetti, Rossier, Bertrand, Litsios, & Salamin, 2018) and A. ocellaris (Tan et al., 2018), have been published. Both of these were based on short-read Illumina technology with genome scaffolding provided by shallow coverage of PacBio (Marcionetti et al., 2018) or Oxford Nanopore (Tan et al., 2018) long reads. While the use of long reads to scaffold Illumina-based assemblies improves contiguity, both genome assemblies are highly fragmented with respective contig and scaffold N50s of 14.9 and 244.5 kb for A. frenatus and 323.6 and 401.7 kb for A. ocellaris. Here, we present a chromosome-scale genome assembly of the orange clownfish, which was assembled using a LEHMANN ET AL. | 571 primary PacBio long read strategy, followed by scaffolding with Hi-C-based chromatin contact maps. The resulting final assembly is highly contiguous with contig and scaffold N50 values of 3.12 and 38.4 Mb, respectively. This assembly will be a valuable resource for the research community and will further establish the orange clownfish as a model organism for genetic and genomic studies into ecological, evolutionary and environmental aspects of reef fishes. To facilitate the use of this resource, we have developed an integrated database, the Nemo Genome DB (www.nemogenome.org), which allows for the interrogation and mining of genomic and transcriptomic data described here.

| Specimen collection and DNA extraction
Adult orange clownfish breeding pairs were collected on the northern GBR in Australia. Fish were bred at the Experimental Aquarium Facility of James Cook University (JCU) and one individual offspring was sacrificed at the age of 8 months. The whole brain was excised, snap frozen and kept at −80°C until processing. High molecular weight DNA was extracted from whole brain tissue using the Qiagen Genomic-tip 100/G extraction kit. The tissue was first homogenized in lysis buffer G2 supplemented with 200 μg/ml RNase A using sterile beads for 30 s. After homogenization, proteinase K was added and the homogenate was incubated at 50°C overnight. DNA extraction was then performed according to the manufacturer's protocol with a final elution volume of 200 μl. DNA fragment size and quality were assessed using pulsed-field gel electrophoresis. This study was completed under JCU animal ethics permits A1961 and A2255.

| PacBio library preparation and sequencing
For Pacific Biosciences (PacBio) long read sequencing, the extracted orange clownfish DNA was first sheared using a g-TUBE (Covaris, MA, USA) (target size of 20 kb) and then converted into SMRTbell template libraries according to the manufacturer's protocol (Pacific Biosciences, CA, USA). Size selection was performed using BluePippin (Sage Science, MA, USA) to generate two libraries with a minimum size of 10 and 15 kb, respectively. Sequencing was performed using P6-C4 chemistry on the PacBio RS II instrument at the King Abdullah University of Science and Technology (KAUST) Bioscience Core Laboratory (BCL) with 360 min movies. A total of 113 SMRT cells were sequenced.

| Mitochondrial genome assembly
The published A. percula mitochondrial genome sequence (NC_023966) was used as a reference to filter the available PacBio reads. Only reads that mapped to the reference using bwa mem version 0.7.10 (Li, 2013) with the PacBio default parameters were retained. This yielded 274 reads with a total length of 2,431,457 bp, an N50 of 12.026 bp, and a predicted coverage of 146X. The mitochondrial reads were then assembled using the Organelle_PBA (Soorni, Haak, Zaitlin, & Bombarely, 2017) pipeline. The resulting assembly was annotated for genes using MitoAnnotator (Iwasaki et al., 2013). To confirm the species of the sampled individual, a phylogeny based on the annotated Cytochrome c oxidase subunit I gene (COI), Cytochrome b (Cyt b) and 12S rRNA was constructed. The sequence data of 11 anemonefish species (A. akallopisos-NC_  (Stamatakis, 2006) with default parameters. Finally, a maximum-likelihood phylogenetic tree was derived from the concatenated multiple alignments using RAxML (Larkin et al., 2007) with the GTRGAMMA model and 500 rounds of bootstrapping (parameters: -m GTRGAMMA -f a -N 500).

| Genome assembly
The genome sequence was assembled from the unprocessed Pac-Bio reads (Table S1) using the hierarchical diploid aware PacBio assembler FALCON version 0.4.0 (Chin et al., 2016). To obtain the optimal assembly, different parameters were tested (Table S2) to generate 12 candidate assemblies. The contiguity of these assemblies was assessed with QUAST version 3.2 (Gurevich, Saveliev, Vyahhi, & Tesler, 2013), while assembly completeness was determined with BUSCO version 2.0 (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). Assembly "A7" exhibits the highest contiguity and single-copy orthologous gene completeness and was selected for further improvement. The FALCON_Unzip algorithm was then applied to the initial A7 assembly obtain a haplotype-resolved, phased assembly, termed "A7-phased." Contigs less than 20 kb in length were removed from the assembly. This phased assembly was polished with Quiver to achieve final consensus sequence accuracies comparable to Sanger sequencing (Chin et al., 2013) using default settings, which produced the "A7-phased-polished" assembly.

| Genome assembly scaffolding with chromatin contact maps
The flash-frozen brain tissue was sent to Phase Genomics (Seattle, WA, USA) for the construction chromatin contact maps. Tissue fixation, chromatin isolation, library preparation and 80-bp paired-end sequencing were performed by Phase Genomics. The sequencing reads were aligned to the A7-phased-polished version of the assembly with BWA (Li & Durbin, 2010) and uniquely mapping read pairs were retained. Contigs from the A7-phased-polished assembly were clustered, ordered and then oriented using Proximo (Bickhart et al., 2017;Burton et al., 2013), with settings as previously described (Peichel, Sullivan, Liachko, & White, 2017). Briefly, contigs were clustered into chromosomal groups using a hierarchical clustering algorithm based on the number of read pairs linking scaffolds, with the final number of groups specified as the number of the haploid chromosomes. The haploid chromosome number was set as 24, which is consistent with the observed haploid chromosome number of the Amphiprioninae, as published for A. ocellaris (Arai, Inoue, & Ida, 1976), A. frenatus, (Molina & Galetti, 2004;Takai & Kosuga, 2007), A. clarkii Takai & Kosuga, 2007), A. perideraion (Supiwong et al., 2015) and A. polymnus (Tanomtong et al., 2012). After clustering into chromosomal groups, the scaffolds were ordered based on Hi-C link densities and then oriented with respect to the adjacent scaffolds using a weighted directed acyclic graph of all possible orientations based on the exact locations of the Hi-C links between scaffolds. Gaps between contigs were represented with 100 Ns and the proximity-guided assembly was named "A7-

| Repeat annotation
A species-specific de novo repeat library was assembled by combining the results of three distinct repeat annotation methods. Firstly, RepeatModeler version 1.08 (Smit & Hubley, 2008) was used to build an initial repeat library. Secondly, we used LtrHarvest (Ellinghaus, Kurtz, & Willhoeft, 2008) and LTRdigest (Steinbiss, Willhoeft, Gremme, & Kurtz, 2009), both accessed via genometools 1.5.6 (Gremme, Steinbiss, & Kurtz, 2013), with the following parameters: - -minlenltr 100 -maxlenltr 6000 -maxdistltr 25000 -mindistltr 1500similar 90. The resulting hits were filtered with LTRdigest, accepting only sequences featuring a hit to one of the hidden markov models in the GyDB 2.0 database. Thirdly, TransposonPSI version 08222010 (Haas, 2018) (Jurka, Klonowski, Dagman, & Pelton, 1996) and Dfam version 2.0 (Wheeler et al., 2012), and were then blasted against the Uniprot/ Swissprot database (release 2017_12) to obtain a unified classification. Furthermore, these three classification methods and the blast result were used to filter out spurious matches to protein-coding sequence. Specifically, putative repeat sequences were only retained when at least one classification method recognized the sequence as a repeat and the best match in Swissprot/Uniprot was not a proteincoding gene (default blastx settings). Furthermore, sequences were retained if two of the three identification methods classified the sequence as repeat, but the best blast hit was not a transposable element. This de novo library was combined with the thoroughly curated zebrafish repeat library provided by Repbase version 22.05 (Bao, Kojima, & Kohany, 2015) and this combined library was employed for repeat masking in the Nemo version 1 assembly using RepeatMasker (Smit, Hubley, & Green, 2010

| Functional annotation
The protein sequences produced from the genome assembly annota-

| Genome assembly comparisons
For genome assembly comparisons, we compared the Nemo version 1 genome assembly to the 26 previously reported fish chromosome-scale genome assemblies (Table S3). Comparisons were made for genome assembly contiguity and completeness. Contig N50 values are reported for the scaffold-scale versions of each assembly and are taken from the indicated publication (Table S3), database description (Table S3) or were generated with the Perl assemblathon_stats_2.pl script (Bradnam et al., 2013). Genome assembly completeness was assessed by determining the proportion of the genome size that is contained within the chromosome content of each assembly. It should be noted that this comparison is relative to the estimated genome size and not the published assembly size. The estimated genome size was taken as either the published estimated genome size in the relevant paper (Table S3) or from the Animal Genome Size Database (Gregory, 2018). Where possible, k-mer-derived or flow cytometry-based estimates of genome size were used. Before calculation, we removed stretches of Ns from the genome assemblies as these are used to arbitrarily space scaffolds and do not contain actual genome information. However, this step was not possible for the Asian arowana, southern platyfish, yellowtail or croaker genomes as the chromosome-scale assemblies have not been made publicly available. Genome assembly completeness was determined with BUSCO (Simão et al., 2015) using the Actinopterygii set of 4,584 genes and the AUGUSTUS zebrafish gene model provided with the software.

| Gene homology
To investigate the gene space of the orange clownfish genome assembly, we used OrthoFinder version 1.1.4 (Emms & Kelly, 2015) to identify orthologous gene relationships between the orange clownfish and four related fish species.

| Database system architecture and software
The Nemo Genome DB database (www.nemogenome.org) was implemented on a UNIX server with CentOS version 7, Apache web server and MySQL Database server. JBrowse (Buels et al., 2016) was employed to visualize the genome assembly and genomic features graphically and interactively. JavaScript was adopted to implement client-side rich applications. The JavaScript library, jQuery (https:// jquery.com), was employed. Other conventional utilities for UNIX computing were appropriately installed on the server if necessary.
All of the Nemo Genome DB resources are stored on the server and are available through HTTP access.

| Sequencing and assembly of the orange clownfish genome
Genomic DNA of an individual orange clownfish (Figure 1a) was sequenced with the PacBio RS II platform to generate 1,995,360 long reads, yielding 113.8 Gb, which corresponds to a 121-fold coverage of the genome (Table S1). After filtering with the read preassembly step of the Falcon assembler, 5,764,748 reads, covering 54.3 Gb and representing a 58-fold coverage of the genome, were available for assembly.
To optimize the assembly parameters, we performed 12 trial assemblies using a range of parameters for different stages of the Falcon assembler (Table S2). The assembly quality was assessed by considering assembly contiguity (contig N50 and L50), total assembly size and also gene completeness (BUSCO) ( However, diploid aware assembly algorithms such as the Fal-con_Unzip assembler are designed to detect single-nucleotide polymorphisms (SNPs) as well as structural variations and to use this information to phase ("unzip") heterozygous regions into distinct haplotypes (Chin et al., 2016). This procedure results in a primary assembly and a set of associated haplotype contigs (haplotigs)  capturing the divergent sequences. Having established the parameter set that gave the best assembly metrics with Falcon, we used Fal-con_Unzip to produce a phased assembly ("A7-phased") of the orange clownfish (Table 2). The phased assembly was 905.0 Mb in length with a contig N50 of 1.85 Mb. As has been seen in previous genome assembly projects (Chin et al., 2016), Falcon_Unzip produced a smaller assembly with fewer contigs than the assembly produced by Falcon ( Table 2). The phased primary assembly was then polished with Quiver, which yielded an assembly ("A7-phased-polished") with 1,414 contigs spanning 903.6 Mb with an N50 of 1.86 Mb (Table 2). This polishing step closed 91 gaps in the assembly and improved the N50 by approximately 14.3 kb. After polishing of the "unzipped" A7-phased-polished assembly, 9,971 secondary contigs were resolved, covering 340.1 Mb of the genome assembly.
The contig N50 of these secondary contigs was 38.2 kb, with over 99% of them being longer than 10 kb in size. Relative to the 903.6 Mb A7-phased-polished primary contig assembly, the secondary contigs covered 38% of the assembly size. To the best of our knowledge, this is the first published fish genome assembly that has been resolved to the haplotype level with Falcon_Unzip.

| Scaffolding of the orange clownfish genome assembly into chromosomes
To build a chromosome-scale reference genome assembly of the orange clownfish, chromatin contact maps were generated by Phase Genomics (Supporting Information Figure S2). Scaffolding was performed by the Proximo algorithm (Bickhart et al., 2017;Burton et al., 2013) on the A7-phased-polished assembly using 231 million Hi-C-based paired-end reads to produce the proximity-guided assembly "A7-PGA" ( Table 2). The contig clustering allowed the placement of 1,073 contigs into 24 scaffolds (chromosomes) with lengths ranging from 23.4 to 45.8 Mb (Tables 2 and 3). While only 76% of the contigs were assembled into chromosome clusters, this corresponds to 98% (885.4 Mb) of total assembly length and represents 95% of the estimated genome size of 938.9 Mb (Tables 2 and 3). This step substantially improved the overall assembly contiguity, raising the N50 20-fold from 1.86 to 38.1 Mb.
A quality score for the order and orientation of contigs within the A7-PGA assembly was determined. This metric is based on the differential log-likelihood of the contig orientation having produced T A B L E 2 Assembly statistics of the orange clownfish genome assemblies
the observed log-likelihood, relative to its neighbours (Burton et al., 2013). The orientation of a contig was deemed to be of high quality if its placement and orientation, relative to neighbours, were 100 times more likely than alternatives (Burton et al., 2013). In A7-PGA, the placements of 524 (37%) of the scaffolds were deemed to be of high quality, accounting for 775.5 Mb (87%) of the scaffolded chromosomes, indicating the robustness of the assembly.
A final polishing step was performed with PBJelly to generate the final Nemo v1 assembly. This polishing step closed 369 gaps, thereby improving the contig N50 by 68% and increasing the total assembly length by 5.21 Mb (Tables 2 and 3). The length of each chromosome was increased, with a range of 23.7 to 46.1 Mb (Figure 1b). Gaps were closed in each chromosome except for chromosome 14, leaving an average of only 28 gaps per chromosome (

| Validation of the orange clownfish genome assembly size
The final assembly size of 908.9 Mb is consistent with the results of a Feulgen image analysis densitometry-based study, which determined a C-value of 0.96 pg and thus a genome size of 938.9 Mb for the orange clownfish (Hardie & Hebert, 2004). Furthermore, our assembly size is in keeping with estimates of genome size for other fish of the Amphiprion genus, which range from 792 to 1,193 Mb (Gregory, 2018). We additionally validated the observed assembly size by using a k-merbased approach. Specifically, the k-mer coverage and frequency distribution were plotted and fitted with a four-component statistical model with GenomeScope (Supporting Information Figure S3a). This allowed us to generate an estimate of genome size as well as the repeat content and level of heterozygosity. However, varying the k-value from   Figure S3b). While the short-read k-mer-based genome size estimate of 906.6 Mb matches the final assembly size of 908.9 Mb very well, the C-value-derived genome size estimate is slightly larger (938.9 Mb). As an additional validation of the accuracy of the genome assembly, we mapped the trimmed Illumina short reads to the Nemo version 1 assembly and observed that 95% of the reads mapped to the assembly and that 84% of the reads were properly paired.
Based on the C-value-derived genome size estimate, there is approximately 29.9 Mb (3.3%) of sequence length absent from our genome assembly. It seems likely that our assembly is nearly complete for the euchromatic regions of the genome given our assessment of genome size and gene content completeness. However, genomic regions such as the proximal and distal boundaries of euchromatic regions contain heterochromatic and telomeric repeats, respectively, are refractory to currently available sequencing techniques and are typically absent from genome assemblies (Bickhart et al., 2017;Hoskins et al., 2007).

| Phylogenetic analysis of mitochondrial genes
The mitochondrial genome of A. percula was assembled using Orga-nelle_PBA (Soorni et al., 2017) and mitochondrial genes were annotated using MitoAnnotator (Iwasaki et al., 2013) (Supporting Information Figure S1a). The consensus length of the mitochondrial genome is 16,638 bp, which is only 7 bp shorter than the reference sequence NC_023966. It contains 13 protein-coding genes, 22 transfer RNA genes, one 12S and 16S ribosomal RNA, and one D-loop control region. The sequence similarity of the complete mitogenomes between A. percula and A. ocellaris (NC_009065) is 95.5% which is consistent with previous reports (Tao, Li, Liu, & Hu, 2016). The phylogenetic analysis of the Cytochrome c oxidase subunit I (COI), Cytochrome b (Cyt b) and 12S rRNA genes from 11 anemonefish species and the Indo-pacific sergeant revealed that the sequenced individual is most likely A. percula (Supporting Information Figure S1b).

| Chromosome-scale fish genome assembly comparisons
To date, chromosome-scale genome assemblies have been released for 26 other fish species (Supporting Information Table S3). Here, we present the first chromosome-scale assembly of a tropical coral reef fish, the orange clownfish. As a measure of genome assembly quality, we assessed the contiguity and completeness of these 27 chromosome-scale genome assemblies. We investigated genome contiguity with the contig N50 metric and characterized genome completeness for each genome assembly by calculating the proportion of the estimated genome size that was assigned to chromosomes. As shown in Figure 1c, the orange clownfish genome assembly is highly contiguous, with a scaffold-scale contig N50 of 1.86 Mb, which is only surpassed by the contig N50 of the Nile tilapia genome assembly. Interestingly, even though different assembler algorithms were utilized, the three genome assemblies based primar- The total number of orthogroups (nOG) followed by the number of genes assigned to these groups is provided below the species name. The number of species-specific orthogroups (nSOG) and the respective number of genes is also indicated, followed by the number of genes not assigned to any orthogroups. (b) The inferred phylogenetic tree based on the ortholog groups that contain a single gene from each species, drawings of the fish species were obtained from Wikimedia commons [Colour figure can be viewed at wileyonlinelibrary.com] suggests that the Nemo v1 assembly is one of the most complete fish genome assemblies published to date (Figure 1c). Only the zebrafish (94%) and Atlantic cod (91%) genome assemblies had a comparably high proportion of their estimated genome sizes scaffolded into chromosome-length scaffolds (Figure 1c). It is likely that the use of both PacBio long reads and Hi-C-based chromatin contact maps contributed to the very high proportion of the orange clownfish genome that we were able to both sequence and assemble into chromosomes.
While assembly contiguity is important, genome completeness with respect to gene content is also vital for producing a genome assembly that will be utilized by the research community. We evalu-

| Anemonefish genome assembly comparisons
Genome assemblies for A. frenatus (Marcionetti et al., 2018) and A. ocellaris (Tan et al., 2018) have been previously reported. While the A. percula genome assembly reported here is based on a PacBio primary assembly, the A. frenatus and A. ocellaris assemblies are based on Illumina short-read technology, with scaffolding provided by a shallow coverage of long reads. The use of a primary PacBio assembly strategy facilitated the production of an assembly that is substantially more contiguous than the previously reported anemonefish genome assemblies (Supporting Information Table S4).

| Genome annotation
To annotate repetitive sequences and transposable elements, we constructed an orange clownfish-specific library by combining the results of Repeatmodeler, LTRharvest and TransposonPSI. Duplicate sequences were removed and false positives were identified using three classification protocols (Censor, Dfam, RepeatClassifier) as well as comparisons to Uniprot/Swissprot databases. After these filtering steps, we identified 21,644 repetitive sequences. These sequences, in combination with the zebrafish library of RepBase, were then used for genome masking with RepeatMasker. This lead to a total of 28% of the assembly being identified as repetitive (Figure 3a and Supporting Information Table S5). It was observed that there is a general trend for increased repeat density towards the ends of chromosome arms (Figure 3b and Supporting Information Figure S4). The total fraction of repetitive genomic sequence is in good agreement with other related fish species (Chalopin, Naville, Plard, Galiana, & Volff, 2015). Similarly, the high fraction of DNA transposons (~10%) is in line with DNA transposon content in other fish species (Chalopin et al., 2015) but is substantially higher than what has been reported in mammals (~3%) (Chalopin et al., 2015;Lander et al., 2001).
Following the characterization of repetitive sequences in the   (Table 3).
The spatial distribution of genes across all 24 chromosomes is relatively even (Figure 1b) (Emms & Kelly, 2015) to identify orthologous relationships between the protein sequences of the orange clownfish and four other fish species (Asian seabass, Nile tilapia, southern platyfish and zebrafish) from across the teleost phylogenetic tree (Betancur et al., 2013). The vast majority of sequences (89%) could be assigned to one of 19,838 orthogroups, with the remainder identified as "singlets" with no clear orthologs. We observed a high degree of overlap of protein sequence sets between all five species, with 75% of all orthogroups (14,783) shared amongst all species (Figure 4a). The proteins within these orthogroups presumably correspond to the core set of teleost genes. Of the 14,783 orthogroups with at least one sequence from each species, a subset of 8,905 orthogroups contained only a single sequence from each species. The phylogeny obtained from these single-copy orthologous gene sequences ( Figure 4b) is consistent with the known phylogenetic tree of teleost fishes (Betancur et al., 2013).
Interestingly, we identified a total of 4,429 sequences that are specific to the orange clownfish, 2,293 (49%) of which possess functional annotations ( Figure 4a). Future investigations will focus on the characterization of these unique genes and what roles they may play in orange clownfish phenotypic traits.

| CONCLUSION
Here, we present a reference-quality genome assembly of the iconic orange clownfish, A. percula. We sequenced the genome to a depth of 121X with PacBio long reads and performed a primary assembly with these reads utilizing the Falcon_Unzip algorithm. The primary assembly was polished to yield an initial assembly of 903.6 Mb with a contig N50 value of 1.86 Mb. These contigs were then assembled into chromosome-sized scaffolds using Hi-C chromatin contact maps, followed by gap-filling with the PacBio reads, to produce the final reference assembly, Nemo version 1. The Nemo version 1 assembly is highly contiguous, with contig and scaffold N50s of 3.12 and 38.4 Mb, respectively. The use of Hi-C chromatin contact maps allowed us to scaffold 890.2 Mb (98%) of the 908.2 Mb final assembly into the 24 chromosomes of the orange clownfish. An analysis of the core set of Actinopterygii genes suggests that our assembly is nearly complete, containing 97% of the core set of highly conserved genes. The Nemo version 1 assembly was annotated with 26,597 genes with an average AED score of 0.12, suggesting that most gene models are highly supported.
The high-quality Nemo version 1 reference genome assembly described here will facilitate the use of this now genome-enabled model species to investigate ecological, environmental and evolutionary aspects of reef fishes. To assist the research community, we have created the Nemo Genome DB database, www.nemogenome. org ( Figure 5), where researchers can access, mine and visualize the genomic and transcriptomic resources of the orange clownfish. Bartossek for the PacBio library preparations. We thank Dr. Salim

This study was supported by the Competitive Research
Bougouffa for stimulating discussions. We also acknowledge Mr.
Tane Sinclair-Taylor for providing the photograph of the orange clownfish ( Figure 1a). This paper is dedicated to our good friend and colleague, Dr. Sylvain Foret.

DATA ACCESSIBILI TY
The assembled and annotated genome as well as the raw PacBio reads and Illumina reads are available at the Nemo Genome DB (https://nemogenome.org). Furthermore, the assembled nuclear and mitochondrial genome assemblies are available on GenBank as BioProject PRJNA436093 and BioSample accession SAMN08615572. Raw sequencing data described in this study are available via the NCBI Sequencing Read Archive (SRP134923).