New genome assembly of the barn owl (Tyto alba alba)

Abstract New genomic tools open doors to study ecology, evolution, and population genomics of wild animals. For the Barn owl species complex, a cosmopolitan nocturnal raptor, a very fragmented draft genome was assembled for the American species (Tyto furcata pratincola) (Jarvis et al. 2014). To improve the genome, we assembled de novo Illumina and Pacific Biosciences (PacBio) long reads sequences of its European counterpart (Tyto alba alba). This genome assembly of 1.219 Gbp comprises 21,509 scaffolds and results in a N50 of 4,615,526 bp. BUSCO (Universal Single‐Copy Orthologs) analysis revealed an assembly completeness of 94.8% with only 1.8% of the genes missing out of 4,915 avian orthologs searched, a proportion similar to that found in the genomes of the zebra finch (Taeniopygia guttata) or the collared flycatcher (Ficedula albicollis). By mapping the reads of the female American barn owl to the male European barn owl reads, we detected several structural variants and identified 70 Mbp of the Z chromosome. The barn owl scaffolds were further mapped to the chromosomes of the zebra finch. In addition, the completeness of the European barn owl genome is demonstrated with 94 of 128 proteins missing in the chicken genome retrieved in the European barn owl transcripts. This improved genome will help future barn owl population genomic investigations.

27× coverage. This assembly consists of 62,122 scaffolds with a N50 of 52,818 bp (Table 2) and was used to resolve the barn owl's position in the bird tree of life (Jarvis et al., 2014;Prum et al., 2015) and to search for genes associated with low-light vision (Hanna et al., 2017;Hoglund et al., 2019;Le Duc et al., 2015;Wu et al., 2016). However, the use of draft genome entails problems such as noncontiguous assembly and missing genes, especially in GC-rich portions of bird genomes (Peona, Weissensteiner, & Suh, 2018). As shown by Warren et al. (2017), adding long reads such as those obtained from single-molecule real-time (SMRT, Pacific Biosciences, thereafter called PacBio) improves genome completeness and does not suffer from PCR amplification bias for the sequencing at GC or AT genome-rich region.
Here, we report a study where we sequenced, assembled, and annotated the genome of a male barn owl (T. alba alba) from Switzerland by combining Illumina and PacBio sequencing. We estimated the assembly quality using several metrics and methods, and in particular, we looked for chromosomal synteny with the American barn owl and the zebra finch and for avian "thought lost" genes found in GC-rich regions. We also examined where the barn owl is positioned in the avian phylogeny by using annotations derived from the American and European barn owls.

| Genome sequencing and assembly
Illumina and PacBio libraries generated a total of 158 Gbp highquality sequences (Table 1) and were assembled into 1.219 Gbp. The final assembly contained 21,509 scaffolds of more than 500 bp with a low proportion of undetermined nucleotides (Ns = 0.79%) and a N50 of 4.6 Mbp ( Table 2). The heterozygosity was estimated to be 0.373% with kmer plot ( Figure S3).
The assembly metrics of the European barn owl genome are compared with those of the chicken (Gallus_gallus-5.0), zebra finch (Taeniopygia_gutattata-3.2.4), collared flycatcher (FicAlb1.5), and American barn owl (ASM68720v1) described in Table S1. The expected maximal assembly size calculated from C-values ranges from 1.50 to 1.69 Gbp (De Vita, Cavallo, Eleuteri, & Dell'Omo, 1994;Venturini, D'Ambrogi, & Capanna, 1986). The NG50 scaffold length is 2.7 Mbp, a value 23 and 30 times lower than the NG50 of the zebra finch and of the chicken genomes, respectively, but 91 times higher than the NG50 of the American barn owl ( Table 2). The longest assembled scaffold (22,155,979 bp) in the European barn owl genome is 7 to 9 times smaller than in the zebra finch, collared flycatcher, and chicken genomes but 44 times larger than in the American barn owl genome ( Table 2). The 605 largest scaffolds in the European barn owl cover 95% of its genome assembly, less than the 1,000 scaffolds necessary to cover 95% of the genome assembly of the zebra finch and the chicken, and over 10,000 scaffolds necessary for the American barn owl ( Figure 2). The assembly comprises 5.21% of interspersed repetitive elements including SINEs, LINES, LTRs, and unclassified elements; of these the LINES and LTRs were the most abundant with 2.46% and 2.49%, respectively. Noninterspersed repeat elements such as small RNA, satellites, simple repeats, and low complexity represent 1.52% of the assembly. The total percent are 1.4, 1.6, 2.9 times lower than in the zebra finch, flycatcher, and chicken, respectively, and 1.2 times higher than the American barn owl (Table 3).

| Quality and completeness assessment
To further assess the quality of the genome of the European barn owl, the Illumina raw reads used to assemble the genome were mapped back to the assembly, resulting in an overall mapping rate above 96%. The completeness of the assembly is supported by searching for Universal Single-Copy Orthologs (BUSCO) (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015 Note: All stats were done for scaffolds >500 bp. To evaluate genome completeness, the European barn owl genome was compared with the other genomes using 4,915 conserved avian orthologous genes using BUSCO. Except for genome size (Gbp) all the data are given in bp. The genome size of the barn owl was the mean derived from the C-values of 1.73 and 1.53 pg of DNA that gave genomic size of 1.69 and 1.50 Gbp with an average of 1.59 Gbp (De Vita et al., 1994;Venturini et al., 1986).

| Annotation
The Augustus best model, trained with the chicken Uniprot reference proteome, predicted more than 38,000 proteins, twice as many as the available NCBI American barn owl annotation (Table 4).
To further evaluate the annotation completeness, the predicted proteins of the European and American barn owls were compared with a set of 30,252 chicken proteins. Global search (ggsearch) found 10,392 similar proteins in the European barn owl, but only 9,109 in the American barn owl (Table 4). In addition, the predicted protein sets of the European and American barn owl annotations were matched against 978 metazoan orthologues of the Universal Single-Copy Orthologs implemented in BUSCO (Simao et al., 2015).
Though the two annotation sets had the same proportion of reference proteins classified as "complete" (73.9%), the European barn owl annotation set had both fewer missing (13.1% vs. 15.6%) and duplicated (1.1% vs. 3.5%) proteins. The latter is remarkable considering the much larger set of annotated proteins in the European barn owl annotation set.

| GC content
As shown for primates and birds, distribution of the GC content varies between intra-and intergenic regions (Botero-Castro, Figuet, Tilak, Nabholz, & Galtier, 2017;Qi et al., 2016). In order to investigate the GC content in the European barn owl, we took advantage of 108,132 bp of 57 well-characterized Sanger sequenced genes (Accession numbers in Table S2). As for humans (Zhang, Kasif, Cantor, & Broude, 2004), the GC content of the genes of the European barn owl varied between and within genes, with 5'UTR being GC-richer than CDS and 3'UTR ( Figure 3a). We observed an average GC content of 51% in exons of the Sanger sequenced genes, which is high compared with the whole genome sequencing value (42%). We also

| Chromosomal synteny of the European barn owl, the American barn owl, and the zebra finch
To distinguish between true rearrangements and technical misassemblies between the two barn owl genomes, we mapped the raw Illumina reads used for the American and European barn owl assemblies to the assembly of the European barn owl (Figure 4a).
Similar coverage is observed for most reads between the European and American barn owls, except for few regions. For instance, scaffold 97 showed an increased coverage in the American barn owl compared with the European barn owl (Figure 4a). To verify that the detected variation in the coverage ratio was due to a physical rearrangement, the copy number of genomic DNA in this region was quantified for both barn owl species by real-time PCR (Figure 4b): We measured relative copy number of the putative duplicated/de- To assess the chromosomal structure of the European barn owl assembly, the scaffolds of the European and American barn owls were aligned to the zebra finch genome and visualized with Circos plots (Krzywinski et al., 2009). The zebra finch genome is one of the closest related species assembled at the chromosomal level (out of 40 haploid zebra finch chromosomes, 2n = 80, 37 have been assembled at the chromosome level, Figure S1; the haploid chromosome number in the barn owl is 46 (Rebholz & Northrop, 1993). A first comparison of all the chromosomes and scaffolds of the Ensembl zebra finch karyotype to the European barn owl scaffolds shows few rearrangements or mis-assemblies ( Figure 5a and Figure S2). These results suggest the overall structure of the European barn owl genome is comparable to the zebra finch genome and confirm the quality of the European barn owl genome assembly.

| Identification of contigs belonging to the sexual chromosomes
Since the American barn owl assembly was based on a heterogametic female (ZW) and the European barn owl assembly on a ho- except for the small scaffold 1931 that appeared to belong to an autosome ( Figure 5b). The Z chromosome had the highest number of rearrangements or miss-assemblies of all chromosomes ( Figure 5b and Figure S2).  (Table S3b). This again suggests that the European barn owl genome is quite complete by bird standards.

| Barn owl position in the tree of life
Previous studies found the position of the barn owl in the Avian phylogeny to be inconsistent when different regions of the genome were considered. In one study (Jarvis et al., 2014), the owl branch was placed either with the raptors (Accipitridae and vulture) or with the Coraciimorphae birds, such as mousebirds, cuckoo roller, and trogons using 48 sequenced genomes; in another study, where 122 avian species but only 259 targeted genes (Prum et al., 2015) were used, the barn owl fall within a new clade, the Inopinaves, a sister group of the Coraciimorphae, mousebird, cuckoo roller, trogons, and falcons but not a sister group to the hawks, which form a separated clade, the Eutelluraves. Since genome quality and completeness could impact phylogenetic analyses, we investigated whether the improved European barn owl genome altered the position of this species in the Avian phylogeny. For this, we generated five datasets: (i, ii) two datasets with either the American or the European barn owl as sole owl species; (iii, iv) two datasets with three other owl species in addition to either the American or European barn owl; and (v) one dataset with both American and European barn owls in the same tree.
For the trees with the single owl representative (i, ii), we obtained largely congruent trees, but with some conflicting branches  Prum et al. (2015) where the barn owl is a sister group of the Coraciimorphae (which include the mousebirds, the cuckoo roller, the trogons) and the falcons but not a sister group to the hawks, which form a separate clade.
The inclusion of three additional owl species (iii, iv) resulted in trees that had the same topology as with either American or European barn owl alone, albeit with lower support values for the conflicting branches ( Figure S4). Importantly, all owls were grouped together in both trees with a high level of support.
Likewise, when combining the European and American barn owls into a single dataset (v), we obtained a tree grouping the two barn owls with high support, however, splitting Leptosmus discolor from the Coraciimorphae clade with low support ( Figure S5).
Recent studies have suggested that this may be due to sub-  Figure S6).
Collectively, these analyses indicate that placement of owls in the Avian tree of life remains uncertain. This is due to the very high level of discordance among individual loci-likely due to a combination of hard-to-resolve, short internal branches, and Incomplete Lineage Sorting. Furthermore, because of the uncertainty in the true placement of owls, we are unable to conclusively assess the impact of genome quality on the inferred owl phylogeny.

| D ISCUSS I ON
The  (Table 2). In addition, the identification of the scaffolds belonging to the Z chromosome and their mapping to the zebra finch indicates that the assembled genome has few miss-assem-  birds, indicating that the genome is of relative high quality and completeness.
We identified 38,000 protein-coding genes in the genome of the European barn owl. In the chicken genome version 5, Warren identified 19,119 protein-coding genes and 6,839 noncoding genes (Warren et al., 2017) and Kuo 60,000 different transcripts using long-read RNA sequencing (Kuo et al., 2017) (Tiley, Kimball, Braun, & Burleigh, 2018

| CON CLUS IONS
Although there is room for improvement when compared to the genomes of model species, the current assembly represents a significant advance from previous genomic data in barn owls. Next step would be to complete our genome to get continuous chromosomes and to characterize the missing parts. As it stands, the European barn owl genome will be an extremely valuable resource for carrying further analyses, both at the structural, functional, and evolutionary level.

| RNA extraction and transcriptome
Six tissues (liver, heart, kidney, testis, growing back feathers, and thalamus) were sampled from a 57-day-old nestling barn owl

| Genome assembly
The five different short-read libraries sequenced on an Illumina HiSeq platform resulted in 605 million paired-end reads and in 106 million mate-pair reads of size 2 × 100 bp. In total, 143 Gbp were sequenced (Table 1). Multiple genome characteristics (genome size, heterozygosity) were estimated using a kmer counting approach (Jellyfish (http://www.cbcb.umd.edu/softw are/jelly fish/) combined with GenomeScope (http://qb.cshl.edu/genom escop e/) (Pearson & Lipman, 1988;Vurture et al., 2017) with a kmer size of 21 and run on the three paired-end libraries later used for contig assembly. The heterozygosity value is derived from the smaller peak at half of the expected coverage (75% for the Illumina reads) ( Figure S3).
After read quality control (Andrews, 2010), reads were assembled using SOAPdenovo (Luo et al., 2012) (version 2.04.240). While paired-end reads were used for assembly and scaffolding, mate-pair reads were used solely for scaffolding. We tried various kmer sizes.
The assembly based on kmer 47 outperformed the other assemblies in the number of scaffolds and N50 metric and was retained. Gaps within this assembly were closed using GapCloser [8] reducing the proportion of Ns from 8.8% to 1.4%. The short-read base assembly was further scaffolded using 37 P4C2-chemistry PacBio smrt cells and PbJelly (English et al., 2012) (version 14.4) with default options.
A second round of gap closing was run reducing the proportion of Ns to 0.79%.

| Quality assessment
Quality and completeness of the assembly was assessed by computing assembly metrics, such as N50, NG50, and longest contig, by mapping back the raw paired-end Illumina reads with bowtie2 (Langmead & Salzberg, 2012) (version 2.2.4) and by searching for universal single-copy orthologs using BUSCO (Simao et al., 2015) (version 2) with the avian BUSCO set and default parameters. The assembly metrics and BUSCO results were compared with the assemblies of the American barn owl, the zebra finch, the flycatcher, and the chicken.

| Transcriptome assembly
The transcriptome was assembled using the six libraries prepared with the Kapa stranded mRNAseq Library Preparation kit (Roche) to minimize the sequencing bias for high GC content regions. Libraries were sequenced using the Illumina HiSeq platform at the GTF resulting in total 192 million reads of size 2 × 100 bp summing up to 34 Gbp per tissue (Table S4). The reads were assembled using Trinity (Haas et al., 2013)  was filtered for transcripts with homologies to known proteins, by first extracting long open reading frames (ORFs) followed by a blast and a pfam search to known proteins (Haas et al., 2013).

| Repeat annotation
The evaluation of repeats and low complexity regions were assessed using RepeatModeler version 1.11.0 (Smit & Hubley, 2008 and RepeatMasker version 4.0.7 (Smit et al. 2013(Smit et al. -2015 with default parameters. RepeatMasker was run with RMBlastn version 2.6.0+ for the genome of the American and European barn owl, the zebra finch, the flycatcher, and the chicken individually.

| Proteome comparison
The predicted proteins of the chicken Augustus assembly and the NCBI predicted proteins of the American barn owl genome were compared in a global-global fashion against the chicken reference proteome from Ensembl (ensembl release 88, 30,252 supported protein sequences) using ggsearch (Pearson & Lipman, 1988) (version 36.3.5e, options: -b 1 -d 0 -E 1e-5 -k 1 -m 8).
We also assessed the completeness of the proteome and compared it to the other proteomes by searching for universal single-copy orthologs using BUSCO (Simao et al., 2015) (version 2) with the metazoa BUSCO set and default parameters.

| Comparison of the European and the American barn owl genome
To detect genome structural variants between American and European barn owls, we did not compare the two assemblies directly to each other because their quality differs, and consequently it would be difficult to distinguish biologically relevant differences from technical artifacts. Instead, we mapped the raw Illumina reads used for both assemblies to the European barn owl assembly using bowtie 2 (Newman et al., 2005). A total of 102,000,000 reads (96.2%) of the American barn owl were mapped on the European barn owl assembly and 206,000,000 reads (95.9%) of the European barn owl were mapped on the European barn owl assembly. Since the coverage was highly variable, we computed the ratio for bins of 1kb and potential interesting sites were searched visually. By comparing the two coverages to each other, it was possible to find real biological differences between the two sequenced barn owls. This was true if the Illumina libraries were prepared and sequenced in a similar way. We expected that the read coverage ratio of European barn owl reads over the sum of American and European raw reads would be equal to 0.5 when the coverages are corrected for library size. Deviating ratios may be indicators of duplications and deletions.

| Comparison of the European barn owl and the zebra finch
To assess the chromosomal structure of the European barn owl assembly we aligned the scaffolds to the zebra finch genome using blast, option MegaBLAST and word size 48. Hits were subsequently filtered for a minimal size of 1,000 bp. The zebra finch genome was one of the closest related species for which the genome was assembled at the chromosomal level. Although this species counts 40 pairs of chromosomes and one germline restricted chromosome (Biederman et al., 2018;Torgasheva et al., 2019), only 37 chromosomes and scaffolds were assembled in Ensembl zebra finch karyotype ( Figure S1).

| Avian lost genes
The same mammalian and lizard lost proteins described by Warren et al. (Tables S3 and S4 in Warren) (Warren et al., 2017) namely the 129 lost proteins in chicken but found in other birds (Table S3a) and the 232 proteins found in no avian genome (Table S3b) were searched in Ensembl BioMart for their human Uniprot Swiss-Prot entries (one of the 129 lost proteins set was not retrieved with BioMArt) and aligned them (tblasn 2.7.1+, evalue < 1e-10) against the barn owl transcriptome (Tyto_Alba_DEE_transcriptome.fasta).
Then, the recovered transcripts were reciprocally aligned (blastx 2.7.1+, with E-value <10 -9 , only the entry with the best alignment score was selected) against Swiss-Prot to ensure that the right protein was found and controlled manually for Swiss-Prot entries with the correct name or correct gene description, if the first hit was of another species.

| Phylogenetic tree inference
We compiled five datasets to assess the impact of a new annotation on the positioning of Tyto alba in the bird tree of life.
As a starting point for all datasets, we selected 9 species from the May 2016 OMA database (Altenhoff et al., 2018) and further 7 species including the American NCBI annotation of the American barn owl were retrieved from http://avian.genom ics.cn/en/jsp/ datab ase.shtml (protein sequences from the GigaDB annotations, Table S5) and our annotations of the European barn owl (Tyto alba alba) set of predicted protein sequences. These included all available saurian species at that time available in the OMA database.
Moreover, this included 2 mammals and Xenopus tropicalis as outgroup. In total this dataset included 17 species. We inferred the orthologous relationships between the species using OMA standalone (Altenhoff et al., 2019). Phylogenetic marker genes were then selected using a threshold of minimum 16 species included in an orthologous group (OG). With this threshold, we obtained 2,578 OGs.
From this, we constructed three datasets. The first two datasets (i, ii) contained either American or European barn owl proteomes.
The third dataset (iii) contained both proteomes.
As for datasets (iv) and (v), we added three more species of owls to dataset (i): Strix occidentalis caurina, Bubo blakistoni, and Athene cuniculata. Details are available in Table S5. OMA standalone was run again to predict the orthology. Phylogenetic marker genes were then selected using a threshold of minimum of 19 species, resulting in 1,053 orthologous groups (OGs).
Assessment of incomplete lineage sorting as source for topological variation was performed with ASTRAL v5.6.3  and PhyParts v0.0.1 (Smith et al., 2015) on the small dataset.

ACK N OWLED G M ENTS
The computations were performed at the Vital-IT (http://www.