Natural selection beyond genes: Identification and analyses of evolutionarily conserved elements in the genome of the collared flycatcher (Ficedula albicollis)

It is becoming increasingly clear that a significant proportion of the functional sequence within eukaryotic genomes is noncoding. However, since the identification of conserved elements (CEs) has been restricted to a limited number of model organisms, the dynamics and evolutionary character of the genomic landscape of conserved, and hence likely functional, sequence is poorly understood in most species. Moreover, identification and analysis of the full suite of functional sequence are particularly important for the understanding of the genetic basis of trait loci identified in genome scans or quantitative trait locus mapping efforts. We report that ~6.6% of the collared flycatcher genome (74.0 Mb) is spanned by ~1.28 million CEs, a higher proportion of the genome but a lower total amount of conserved sequence than has been reported in mammals. We identified >200,000 CEs specific to either the archosaur, avian, neoavian or passeridan lineages, constituting candidates for lineage‐specific adaptations. Importantly, no less than ~71% of CE sites were nonexonic (52.6 Mb), and conserved nonexonic sequence density was negatively correlated with functional exonic density at local genomic scales. Additionally, nucleotide diversity was strongly reduced at nonexonic conserved sites (0.00153) relative to intergenic nonconserved sites (0.00427). By integrating deep transcriptome sequencing and additional genome annotation, we identified novel protein‐coding genes, long noncoding RNA genes and transposon‐derived (exapted) CEs. The approach taken here based on the use of a progressive cactus whole‐genome alignment to identify CEs should be readily applicable to nonmodel organisms in general and help to reveal the rich repertoire of putatively functional noncoding sequence as targets for selection.

mammals (Lindblad-Toh et al., 2005;Miller et al., 2007;Siepel et al., 2005), fish (Braasch et al., 2016;Jones et al., 2012;Lowe et al., 2011) and crucifer plants (Haudry et al., 2013;Williamson et al., 2014). In the best-studied organisms such as human, the detection of conserved elements (CEs) has recently been performed at increasingly fine-scale resolutions Rosenbloom et al., 2015). While protein-coding sequence can be determined using various approaches, comparative analyses are key to identifying putatively functional (yet still unknown) noncoding sequence evolving under selective constraint. As an indication of the potential importance of this latter category of sequences, in humans conserved noncoding elements (CNEs) constitute~3.5% of the genome, relative tõ 1.5% for protein-coding sequence Miller et al., 2007;Siepel et al., 2005). Many CNEs function as cis-regulatory elements and have been shown to be especially associated with genes involved in transcriptional regulation and development (Bejerano et al., 2004;Lowe et al., 2011;Woolfe et al., 2005).
The identification of CNEs has proved of considerable importance in enhancing genome annotation , investigating signatures of adaptive evolution in noncoding sequence (Halligan et al., 2013;Hernandez et al., 2011;Williamson et al., 2014), and for identifying putative trait loci (Marcovitz, Jia, & Bejerano, 2016) and transposable element (TE) exaptation events (Lowe & Haussler, 2012;Mikkelsen et al., 2007;Xie, Kamal, & Lander, 2006), the latter potentially being an underrated source of phenotypic novelty. Incorporating putatively functional noncoding annotation is particularly important in the context of ecological or evolutionary studies of natural populations of nonmodel organisms. In the absence of identified CNEs for essentially all such organisms, nomination of candidate loci from signals in genome scans or QTL mapping efforts is necessarily limited to protein-coding genes, which certainly may be misleading.
Furthermore, determining the density and distribution of genomic targets for selection is required to better understand within-genome heterogeneity in genetic diversity (Ellegren & Galtier, 2016).
Alongside variation in the recombination rate, the density of such sites is expected to be the main determinant of variation in genetic diversity via the diversity-reducing effects of selection on linked sites (either by background selection or selective sweeps ;Cutter & Payseur, 2013). In lieu of functional noncoding sequence data for a species of interest, exonic or coding sequence density alone has often been used to approximate targets for selection (Ellegren & Galtier, 2016). However, for many species this is unlikely to represent a suitable proxy, as coding sequence will underestimate the actual density of sites under selection (cf. above). Moreover, coding sequence may potentially fail to accurately capture the true distribution of sites under selection, as at least in vertebrates many CNEs are located within large "gene deserts" containing no protein-coding sequence (Ovcharenko et al., 2005;Siepel et al., 2005), and a negative correlation between the densities of conserved noncoding sequences and exons has been reported in human (Dermitzakis, Reymond, & Antonarakis, 2005). Thus, it is important to incorporate putatively functional noncoding sequence alongside protein-coding sequence when approximating the density and distribution of genomic targets for selection. While this is currently applied for a limited number of well-studied taxa such as human (Enard, Messer, & Petrov, 2014;Hernandez et al., 2011;Lohmueller et al., 2011; McVicker, Gordon, Davis, & Green, 2009) and Drosophila melanogaster (Comeron, 2014;Elyashiv et al., 2016;Halligan & Keightley, 2006;Sella, Petrov, Przeworski, & Andolfatto, 2009), the necessary annotation of noncoding sequence is typically unavailable for studies of nonmodel organisms.
The majority of multispecies whole-genome alignments currently available are referenced to a defined species, including the frequently used MULTIZ alignments (Blanchette et al., 2004) available from the UCSC Genome Browser (Karolchik et al., 2003). The reference species is most often a model organism of the group in question, for example human in the case of mammalian alignments Miller et al., 2007;Siepel et al., 2005) or D. melanogaster for drosophilids (Siepel et al., 2005;Stark et al., 2007). Although it is possible to utilize CEs predicted from an alignment referenced to a species other than the focal species using tools such as UCSC's LIFTOVER, such an approach is likely to underestimate conservation for two related reasons. First, any lineage-specific functional elements that arose on the lineage to the focal species after the branching event between the reference species lineage and focal species lineage will be absent from the alignment and hence any CE prediction. Likewise, any evolutionarily ancient elements lost on the reference species lineage but remaining functional on the focal species lineage will also be absent. The degree of underestimation of conserved sequence is therefore expected to increase in severity as the evolutionary distance between the reference and focal species increases. Although there is often insufficient power to detect many lineage-specific CEs at recent evolutionary timescales (Ponting, 2008;Rands, Meader, Ponting, & Lunter, 2014), even if the number detected is small, such elements are likely to be of specific interest as putative candidates for lineage-specific function or loss of function (Chan et al., 2010;Lowe, Clarke, Baker, Haussler, & Edwards, 2015;Marcovitz et al., 2016;McLean et al., 2011;Schmidt et al., 2010).
Advances in alignment algorithms have led to the production of whole-genome alignments without reference species (e.g., Dubchak, Poliakov, Kislyuk, & Brudno, 2009;Paten, Herrero, Beal, Fitzgerald, & Birney, 2008). A recent example of this approach is PROGRESSIVE CACTUS (Paten et al., 2011), which produces alignments in the hierarchical alignment format (HAL; Hickey, Paten, Earl, Zerbino, & Haussler, 2013). HAL is indexed on all genomes contained within the alignment and can be easily converted to the frequently used multi-alignment format (MAF) with any user-specified reference species. Here, we utilize this flexibility to predict CEs explicitly for a nonmodel organism, by converting an existing 23 sauropsid (birds and reptiles) wholegenome alignment produced by Green et al. (2014) to an alignment referenced to the collared flycatcher (Ficedula albicollis) genome. We explore the density and distribution of CEs and utilize population resequencing data to investigate patterns of genetic diversity at different classes of conserved sites. Using these approaches, we find that the majority of conserved sites in the genome are nonexonic, CRAIG ET AL. | 477 that these conserved nonexonic sites have substantially reduced genetic diversity, and that CNE density is negatively correlated with functional exonic density at local genomic scales. We further incorporate deep transcriptome sequencing and TE annotation to improve genome annotation and identify thousands of CEs putatively derived from TE exaptation events.
Further formatting was performed using MAFTOOLS , with the alignment split into individual files for the 33 assembled collared flycatcher chromosomes (Kawakami et al., 2014), and an additional file containing unassembled scaffolds (with any scaffolds containing fewer than 2,000 alignable sites excluded, where successful alignment was defined at any individual site as alignment between a collared flycatcher base and bases from at least two other species). An in-house Perl script that incorporates the EMBOSS tool CONSAMBIG (Rice, Longden, & Bleasby, 2000) was used to create consensus sequences where there were two or more sequences from the same species present in a MAF block.

| Prediction of conserved elements
PHASTCONS (Siepel et al., 2005) was used to predict CEs from the whole-genome alignment. The phast tool PHYLOFIT (REV model) was used to generate a neutral evolutionary model from fourfold degenerate (4D) sites extracted from the alignment, downloaded as the file WholeGenome_4Dsites_filter2.phy from GigaScience (https://doi. org/10.5524/100125). These 114,709 4D sites were previously defined using nonduplicated coding transcripts from the American alligator (Alligator mississippiensis), where codons were only included if they were present in all species and encoded the same amino acid (Green et al., 2014). The topology of the phylogeny was identical to that derived by Green et al. (2014). Although some 4D sites may be subject to evolutionary constraint (Chamary, Parmley, & Hurst, 2006), including in avian species (Kunstner, Nabholz, & Ellegren, 2011), any subsequent reductions in branch length are expected to reduce power to detect CEs, resulting in only somewhat more conservative predictions. PHASTCONS was run using two alternative parameter sets: the "standard" parameters (SP) that are used to produce the "most conserved" tracks in the UCSC genome browser (expected length = 45, target coverage = 0.3, rho = 0.31; Miller et al., 2007), and a "tuned" parameter (TP) set that was specifically tuned for use with the alignment (expected length = 12, target coverage = 0.0225, rho = 0.27).
Tuning criteria are detailed in the Supporting information.
Additional CE prediction was performed using GERP++ (Davydov et al., 2010) on the same alignments and evolutionary model, using default parameters. For both PHASTCONS and GERP++ predicted CEs, any bases falling within assembly gaps were filtered, and any CEs that comprised more that 20% of such bases were excluded entirely.

| Annotation of conserved elements by site class
Sequence coordinates for CDS, 5 0 and 3 0 UTRs, and pseudogenes were taken from the ENSEMBL version 73 collared flycatcher genome annotation (see Uebbing et al. (2016) for detailed description), translated from the FicAlb1.4 to the FicAlb1.5 assembly. RNA gene coordinates for miRNAs, rRNAs, snoRNAs, snRNAs and other miscellaneous RNAs were also taken from the ENSEMBL annotation and combined with additional rRNA, scRNA, snRNA and tRNA genes annotated by an in-house pipeline (see Repeat annotation). For protein-coding genes with annotated 5 0 UTRs of at least 15 bp, promoters were annotated as 2 kb upstream from the transcription start site. Putatively novel coding sequence, UTR sequence and long noncoding RNA (lncRNA) genes were annotated based on expression data (Supporting information). Following Lindblad-Toh et al. (2011), the intersect between CEs and the various annotated classes was defined in a hierarchical format, such that if a base within a CE overlapped two or more different classes, it was assigned to a single class based on the first appearance in the following order: CDS, 5 0 UTR, 3 0 UTR, promoter, RNA gene, novel CDS, novel UTR, lncRNA, intronic, intergenic. This approach affected only 1.1% of CE bases.
Conserved noncoding elements were defined as CEs without any overlap with exon-associated site classes (CDS, 5 0 and 3 0 UTRs, promoters, novel CDS/UTRs, RNA genes and pseudogenes). Similarly defined elements have previously been more accurately termed "conserved nonexonic elements" (e.g., by Lowe et al. 2011), but we here use the more widely used term "conserved noncoding elements." CEs overlapping lncRNAs but no other exonic site classes were defined as CNEs, as despite these being putatively exonic, their function (if any) is most often unknown, and hence, it is unclear whether the observed conservation is due to their transcription or some other function.
2.4 | Distribution and density of conserved noncoding elements CNE density and the density of CDS and any other exon-associated CEs (hereafter referred to as CDS + exonic CEs) were calculated for nonoverlapping 200-kb windows along the genome. Windows were discarded if they included assembly gaps between scaffolds, and window size was therefore selected to maximize the number of complete windows residing on single scaffolds (a size of 200 kb resulted in 5,000 windows covering 89.4% of the genome). Total CE densities were calculated per chromosome, and the percentage of alignable sites per chromosome was noted (where alignment between collared flycatcher and at least two other species at an individual site was considered successful). CE densities were then corrected to reflect the number of alignable bases per chromosome rather than total chromosome length. All correlations were performed on 31 of the 33 assembled chromosomes, with the two smallest chromosomes (LG34 and LG35; both <500 kb) excluded as <2% of their bases were alignable.
Gene deserts were defined as intergenic regions >200 kb and were further categorized as evolutionarily stable or variable based on collared flycatcher-human alignability (stable deserts >5% sites aligned, variable <5% sites aligned, Supporting information).

| Comparison of conservation and polymorphism
We used the protocol for estimating nucleotide diversity described in Dutoit, Burri, Nater, Mugal, and Ellegren (2016), based on wholegenome resequencing data from 20 (10 males, 10 females) collared flycatchers from a single Italian population from the Apennine Mountains (available on the European Nucleotide Archive (ENA), Accession no. PRJEB7359). Information regarding the sampling, DNA analyses and bioinformatics methods used to produce these data can be found in Burri et al. (2015), and information regarding additional variant filtering and estimation of nucleotide diversity is given in Dutoit et al. (2016). Briefly, nucleotide diversity (p; Nei & Li, 1979), the average number of pairwise differences per site among chromosomes in a population, was estimated collectively for the 20 individuals (40 chromosomes) using in-house Python scripts and the Python package PYVCF 0.4.0 (https://pyvcf.readthedocs.org). This was done for all sites within each of the annotated site-class categories described above, for the intersect between CEs and each site-class category, and for the remaining sites not overlapped by CEs. Only autosomal sequences were analysed.

| Update of genome annotation using transcriptome sequencing
To investigate if any conserved bases overlapped previously unannotated exonic sequences (CDS and UTRs) and to investigate patterns of sequence conservation in lncRNA genes, we updated the collared flycatcher genome annotation using deep transcriptome data. Briefly, putatively novel protein-coding sequences were identified using BRAK-ER1 (Hoff, Lange, Lomsadze, Borodovsky, & Stanke, 2016) with STAR alignments (Dobin et al., 2013), and novel UTR sequences were identified using MAKER (Holt & Yandell, 2011) with a de novo transcriptome assembly produced using TRINITY . A reference-based transcriptome assembly produced with CUFFLINKS (Roberts, Trapnell, Donaghey, Rinn, & Pachter, 2011;Trapnell et al., 2010) from the STAR alignments was used in conjunction with a pipeline of filtering steps to annotate lncRNA genes. Detailed methods are available in the Supporting information.

| Repeat annotation
To identify CEs putatively originating from the exaptation of transposable elements, we utilized a recent in-depth annotation of collared flycatcher repeats (Suh, Smeds, & Ellegren, 2017). Briefly, repetitive elements and additional RNA genes were identified in the FicAlb1.5 genome assembly using REPEATMASKER v4.0.1 (Smit, Hubley, & Green, 1996  TEs from a single superfamily.

| Identification of lineage-specific conserved elements
For each collared flycatcher CE, presence/absence of orthologous sequence in the whole-genome alignment was scored for each species to determine lineage-specific CEs. Following Lowe et al. (2015), absence of a CE was defined in a species if less than one-third of orthologous bases were aligned. CEs were then defined as specific to a lineage if they were present in a species from the deepest branch of that lineage, but absent in all more distantly related species. As this approach could potentially lead to false positives due to gaps in genome assemblies and/or alignment errors, CEs were only defined as lineage-specific if there were at least three outgroup species (all with an absence of orthologous CE sequence) available at a particular branch for comparison. This resulted in four scales of lineage-specificity: archosaur-specific CEs (absent from anole lizard Anolis carolinensis and all four turtles, but present in at least one crocodilian), avian-specific CEs (absent from all reptiles, including the three crocodilian species, but present in ostrich Struthio camelus (representing the deepest branching avian clade)), neoavian-specific CEs (absent from all reptiles, ostrich and the three fowl species, but present in rock pigeon Columba livia) and passeridan-specific CEs CRAIG ET AL.

| 479
(absent from all reptiles and all nonpasseridan birds (with the three parrot species as outgroups), but present in ground tit Pseudopodoces humilis). CEs were defined as dating from before the origin of all extant sauropsids (~280 mya, Alfoldi et al., 2011) if they were present in the anole lizard.
It should be noted that if the whole-genome alignment does not contain a representative species from the deepest lineage for one of the clades of interest (potentially for Neoaves, Suh, 2016;and almost certainly for Passerida, Moyle et al., 2016), then CEs specific to these clades are only defined herein in relation to the species included in this study. Also, for most lineages specificity is only defined based on presence in a single species, which could underestimate CE lineage-specificity (again due to assembly gaps/alignment errors). Finally, a small proportion of CEs assigned to a particular evolutionary branch may have a more ancient origin than inferred due to CE loss events. For example, a CE that originated prior to the split between crocodilians and birds that was then subsequently lost in the crocodilian lineage (and is hence absent from all extant crocodilians) would correctly be defined as avian specific, but its origin would be incorrectly assigned to the evolutionary branch between crocodilians and birds.
To investigate potential phenotypic consequences of CE innovation, we assigned genes with the nearest upstream or downstream transcription start site to each neoavian-and passeridan-specific CNE, as enhancers can influence gene expression over long distances in either orientation (Visel, Rubin, & Pennacchio, 2009). CEs specific to these lineages that had direct overlap with genes were assigned to those genes. GO term analysis for the associated genes was then performed as described for genes flanking gene deserts (Supporting information).

| 23 Sauropsid whole-genome alignment
The effectiveness of a whole-genome alignment for identifying CEs is a function of the total branch length of the phylogeny representing the aligned species, as well as the alignability of the genomes included (Cooper et al., 2003). While the power to detect CEs can be increased by aligning as many distantly related species as possible (maximizing total branch length), such increases in phylogenetic scope result in a diminishing proportion of the reference genome that can be aligned to all other species (resulting in only the most ancient CEs being identified with the full power available ;Cooper et al., 2003;Miller et al., 2007). The quality of the included genomes is also important, as lower assembly qualities will further reduce alignability (Miller et al., 2007). Hence, an optimal alignment will include high-quality genome assemblies for species that maximize branch length at a reasonable phylogenetic scope.
The 23 sauropsid whole-genome alignment produced by Green et al. (2014) is currently the most species-rich reference-free alignment that includes the collared flycatcher genome. As such, it was the optimal alignment for our purpose of identifying CEs in the genome of a nonmodel organism, without the necessity of producing a new alignment. The phylogeny connecting the 23 sauropsids (Table S1; 15 birds, three crocodilians, four turtles and anole lizard) provided a total branch length of 2.63 substitutions per 4D site (subs/site), and the subphylogeny connecting the 15 avian species 1.24 subs/site ( Figure 1a). Notably, four of the five high-quality assemblies were avian species (collared flycatcher, zebra finch Taeniopygia guttata, chicken Gallus gallus and turkey Meleagris gallopavo), providing higher quality genomes at the more recent phylogenetic scope of the alignment. Considering alignability, 88.6% of sites were alignable between collared flycatcher (the focal species) and at least two other species (Figure 1b), and at such sites, the average number of aligned species was 15.5, with an average total branch length of 1.43 subs/site. Comparatively, 96.0% of collared flycatcher coding sequence (CDS) was alignable, with the average number of aligned species increasing to 21.5, and average total branch length to 2.43 subs/site, at such sites.

| Conserved elements
The PHASTCONS analyses identified~1.41 and~1.28 million CEs, spanning 8.2% (91.3 Mb) or 6.6% (74.0 Mb) of the collared flycatcher genome, for the SP ("standard") and TP ("tuned") methods, respectively (Table S2). The genomic regions corresponding to SP-predicted CEs had~7 times as many bases overlapping both assembly gaps and flycatcher lineage-specific TEs relative to the TP-predicted CEs, and some of the longest SP-predicted CEs harboured long assembly gaps (e.g., the longest contained two gaps of 590 and 312 bp), indicative of either "coverage" being too high or over-"smoothing" in the PHASTCONS model (Siepel et al., 2005). The TP-predicted CEs were almost entirely overlapped by the SP data set (<0.1% of TP-predicted CEs had no overlap), suggesting that they could be treated as a higher confidence data set contained within the larger SP data set.
The TP data set presumably had a lower rate of false positives, although conceivably a higher rate of false negatives; we thus note that the SP data set can be treated as an upper bound for PHAST-CONS-predicted conservation in the collared flycatcher. We therefore focused all further analyses on the TP data set of CEs. GERP++ analysis identified 817,511 CEs spanning 12.3% (137.3 Mb) of the genome. As is the case in human, PHASTCONS-predicted CEs generally fell within the longer GERP++ CEs (Davydov et al., 2010), with 88.0% and 94.5% of PHASTCONS CEs (SP and TP methods, respectively) overlapped by GERP++ CEs. The higher congruence between the GERP++ CEs and PHASTCONS TP-predicted CEs (relative to SP-predicted CEs) also supported the decision to focus analyses only on the TP data set.

| Annotation of conserved elements by site class
Incorporating annotation from CDS, 5 0 and 3 0 UTRs, promoters, RNA genes and novel putative CDS and UTRs, 31.7% of CEs (23.5 Mb) were found to be associated with exonic sequence. The majority of putatively functional sequence in the collared flycatcher genome is thus neither protein-coding nor exonic. At the nucleotide level, 23.8% of CE bases overlapped CDS, 0.5% 5 0 UTRs, 3.1% 3 0 UTRs, 0.6% promoters, 0.1% RNA genes and 0.9% previously unannotated CDS and UTR sequence (Figure 2). With the exception of promoters, all of these site-class categories overlapped more CE bases than expected given a random distribution of conserved bases and exhibited the highest proportions of bases overlapped by CEs (Table 1).
Unsurprisingly, CDS showed the most constraint among site-class categories, with 70.4% of bases overlapped by CEs, a 9.2-fold increase relative to random expectation. This corresponded to 89.6% of CDS exons having at least a partial overlap with CEs.
Despite the extensive conservation detected in exonic sequence, most CE bases were found within intronic and intergenic sequence (23.7% and 44.4%, respectively; Figure 2). Putative lncRNA genes also overlapped a substantial proportion of CE bases (2.9%), but fewer than expected by chance, consistent with high levels of evolutionary turnover/lineage-specificity and/or limited selective constraint (Supporting information). The majority of the conserved bases overlapping lncRNA genes, introns and intergenic sequence corresponded to over 900,000 CNEs covering 50.5 Mb of the genome, with these elements on average 14.8 bp shorter than exon-associated CEs (mean length 53.8 bp, relative to 68.6 bp).

| Distribution and density of conserved noncoding elements
Considering all CDS and other exonic CEs collectively as a proxy for functional exonic sites, total density of such sequences was 2.6% of the collared flycatcher genome (28.9 Mb), relative to 4.5% (50.5 Mb)  F I G U R E 1 Phylogeny and alignability of the species in the 23 sauropsid wholegenome alignment (Green et al., 2014). (a) Sauropsid phylogeny for all 23 species based on substitution rates at fourfold degenerate sites extracted from the wholegenome alignment. Branch labels represent substitutions per site, with branches with >0.05 substitutions coloured red and labelled (these branches account for the majority of the total branch length, and hence power to detect CEs). Species names of high-quality genomes are coloured green, medium-quality blue and low-quality black (see Table S1). (b) The number of alignable bases for each species, where successful alignment was defined at a site as alignment between collared flycatcher and at least two other species. Colour codes indicate taxonomic order for total CNE density, reinforcing that most conserved sequences are embedded within noncoding regions. CDS + exonic CE density and CNE density were negatively correlated (200 kb windows, Pearson's correlation, R = À.284), showing that CDS + exonic CE density is not an appropriate proxy for the overall density of targets for selection ( Figure 3a). Additionally, the correlation between CNE density and CDS density (the most common proxy for genomic targets for selection) was slightly more negative (R = À.292) in the same genomic windows. These results were largely independent of window size, with negative correlations observed for sizes at least up to 2.5 Mb. CNE density in 200 kb windows ranged from 0% to 28.6% of bases (median 3.6%) and CDS + exonic CE density from 0% to 29.1% (median 1.8%). There were 760 windows (15.2%) with 0% CDS + exonic CE density, 710 of which had a CNE density >1% (mean CNE density 6.0%).
As the above negative correlations could be artefacts caused by windows with high exonic density having lower CNE density due only to having less nonexonic sequence, CNE densities were recalculated based on the proportion of nonexonic sequence in each window (rather than the full 200 kb). Both correlations remained negative after this correction, although less so (corrected CNE and CDS + exonic CEs, R = À.222; corrected CNE and CDS, R = À.233).
Examining the largest intergenic regions of the genome, 747 intervals longer than 200 kb contained no protein-coding genes and were considered gene deserts. These regions comprised 51.2% As in human, CNE density in variable deserts was considerably lower (3.6%) than that observed in stable gene deserts (8.6%). CNEs located within variable gene deserts were more often lineage-specific, with 14.5% identifiable as avian-specific and 38.9% identifiable as present in the anole lizard, relative to 9.1% and 47.4% for stable gene desert CNEs. The latter result suggests that although CNE turnover may be reduced in stable gene deserts, it is still considerably higher than that observed for exon-associated CEs (90.6% of which are present in the anole lizard). Alternatively, genuinely orthologous CNEs may exist in the anole lizard, but these may be difficult to align in such large intergenic regions. Genes immediately flanking stable gene deserts were enriched for GO terms relating to transcriptional regulation and DNA binding (Table S3) and included the DACH1 and SOX2 genes that also flank human stable gene deserts (Ovcharenko et  Fold enrichment was calculated based on the increase or decrease in overlap observed relative to the expectation under a random distribution of CE bases across the genome.

2005), while genes flanking variable gene deserts were instead
enriched for a wider range of ontology terms (Table S3). between size and density was only weakly negative (R = À.144). This suggests that the higher gene density on microchromosomes is largely driving the correlation between chromosome size and CE density, and potentially the observed reductions in genetic diversity relative to macrochromosomes. Despite the differences in exonic densities, there may not be substantial differences in CNE densities between macro-and microchromosomes, although better assemblies and alignments of microchromosomes may be required to investigate any potential differences in CNE densities further.
3.5 | Patterns of genetic diversity at conserved sites Nucleotide diversity was reduced for all CEs (p = 0.00142) by 66.7% relative to nonconserved intergenic sequence (used as a putative neutral reference, p = 0.00427). Across all site-class categories (with the exception of pseudogenes, presumably due to nonfunctional collared flycatcher CEs being identified via alignment to still functional orthologs in other species) nucleotide diversity was reduced for bases overlapped by CEs, relative to neutral sequence as well as to sites within the same site-class not overlapping CEs ( Figure 5). Comparing sites within the same site class (and ignoring pseudogenes), the reduction between CE and non-CE sites ranged from 58.2% for CDS to 72.7% for 3 0 UTRs. Conserved exonic sequence generally exhibited the lowest nucleotide diversity, with 3 0 UTRs the least variable site class (p = 0.000904), followed by novel UTRs (which include 3 0 and 5 0 UTRs, p = 0.00100), RNA genes (p = 0.00103), CDS (p = 0.00113, which is presumably higher than most other exonic categories due to CEs containing both degenerate and nondegenerate sites), 5 0 UTRs (p = 0.00116) and novel CDS (p = 0.00164).

| Exaptation of transposable elements
Based on at least 50% of bases being overlapped by a TE (or TEs from the same superfamily), 10,287 CEs (~626 kb) were identified as putatively exapted from 7,298 TEs (some TEs harbour more than one CE). Of these CEs, the vast majority (9,980,~607 kb) were CNEs derived from 7,085 TEs. This corresponds to 0.8% of CEs, and 1.1% of CNEs, putatively originating from TE exaptation events.
Lineage-specific CEs may more accurately capture the proportion of TE-derived CEs (Lowe & Haussler, 2012), and 1.4% of avian-specific CEs (1.5% of CNEs) were identified as putatively exapted (Supporting information). While the exaptation of TEs as cis-regulatory elements can be summarized by solely considering CNEs, TEs have also been shown to contribute functional sequence within exonic regions (e.g., in UTR sequence and RNA genes, Chuong, Elde, & Feschotte, 2017;Mikkelsen et al., 2007, as well as in protein-coding sequence via exonization, . We therefore considered the full set of CEs for our analyses, although the results are largely identical when considering all CEs (Table 2) or only CNEs (Table S4).
The most abundant superfamily of TEs in the collared flycatcher (and most birds) are CR1 LINEs (Kapusta & Suh, 2016), and unsurprisingly, these elements have been involved in more than twice as many exaptation events as other TE superfamilies ( | 485 these families have contributed to the origin of functional sequence, such exaptation events are likely to be absent from the CE data due to a lack of power to detect conservation at this recent evolutionary timescale. In contrast, the superfamilies with the highest relative numbers of exaptations were ancient (Supporting information); for example, most SINE superfamilies exhibited high proportions of exapted TEs (45.1% -67.5%, Table 2). These superfamilies contain the CORE-SINE families, copies of which are present in all amniote genomes and show no evidence of activity since the last common ancestor of birds (Kapusta & Suh, 2016).

| Lineage-specific conserved elements
Based on presence/absence of aligned bases, 679,184 CEs (53.0% of all CEs) were inferred as having an evolutionary origin before the emergence of the sauropsid clade~280 mya (Alfoldi et al., 2011), while 219,311 (17.1% of CEs) were defined as having an origin after the split between Testudines (turtles) and Archosauria (crocodilians and birds)~257 mya (Wang et al., 2013; with the remaining CEs identifiable in at least one turtle genome, but not in the anole lizard genome). Of these lineage-specific CEs, 208,505 (95.1%) could be assigned to one of the four evolutionary branches considered (archosaur-, avian-, neoavian-and passeridan-specific) based on CE absence in at least three outgroup species (Table 3). Considering the CEs assigned to one of these four lineages, over 95% were CNEs, compared to 68.3% for all CEs and 56.9% for CEs present in the anole lizard genome. This result is broadly consistent with a higher rate of turnover of functional noncoding sequence relative to coding sequence (Mikkelsen et al., 2007;Rands et al., 2014).
We identified 41,820 CEs (~1.8 Mb) that were absent from the chicken genome. These presumably represent CE losses on the chicken lineage, or CE gains on the collared flycatcher lineage, after the split between Neoaves and Galloanserae. We consider these CEs as a conservative estimate of the amount of conserved sequence that would be missed by utilizing existing CE predictions from the chicken genome in a passeridan species (e.g., using a LIFTOVER-like tool). Additionally, we further identified 2,252 neoavian-specific CEs and 247 passeridan-specific CEs. Genes associated with neoavianspecific CEs were enriched for GO terms related to regulatory functions, as well as terms involved specifically in the development of certain tissues/organs (e.g., the pituitary gland and pharynx; Table S6). None of the genes associated with passeridan-specific CEs were significantly enriched for GO terms; however, intriguingly several of the most significant terms prior to FDR correction were related to neural functions (Table S6). These regions may be interesting candidates for further research given the interest in songbird neuroscience (Warren et al., 2010).

| DISCUSSION
This study demonstrates how CEs can be identified and analysed in a nonmodel organism using a combination of comparative genomic, population genomic and transcriptomic approaches. We used PRO-GRESSIVE CACTUS alignments, which can efficiently extend the prediction of CEs to any species with an available genome sequence included in an alignment, in turn enabling the identification of CEs specific to previously unstudied lineages. In the following, we discuss specific aspects related to the identification and analysis of CEs in the collared flycatcher genome.

| Comparison with existing predictions of avian conserved elements
Two previous studies have used species-rich whole-genome alignments to predict CEs in the chicken genome, with Lowe et al. (2015) producing a 19-way vertebrate species alignment and predicting  Miller et al., 2007). It should though be noted that birds exhibit a higher genomic coverage of CEs relative to mammals as a result of their smaller genome size (~6.5%-8.5% in birds vs.~5% in mammals).
The difference in the estimated total amount of conserved sequence between these vertebrate lineages could potentially be explained by smaller avian genomes requiring less regulatory sequence involved with the organization of chromatic structure.
Alternatively, there has been some evidence for substantial gene loss during avian evolution (Lovell et al., 2014;Meredith, Zhang, Gilbert, Jarvis, & Springer, 2014;Zhang et al., 2014), which would be expected to reduce the amount of conserved sequence as genic sequences comprise a large proportion of CEs (and intergenic CNEs specifically associated with a lost gene would in turn also be lost).
However, the idea that birds would contain fewer genes than mammals has been questioned (Hron, Pajer, Paces, Bartunek, & Elleder, 2015) and may be an incorrect conclusion resulting from an underrepresentation of GC-rich genes in short-read avian assemblies (as there are inherent difficulties in sequencing and assembling GC-rich regions using Illumina sequencing, Bentley et al., 2008;Chaisson, Wilson, & Eichler, 2015). Moreover, while studies of mammalian CEs have been able to utilize the most complete vertebrate genome assemblies in human and mouse, as well as high-quality draft genomes for a number of other species (e.g., chimpanzee, rat, and dog), avian genome sequencing has a substantial challenge of assembling the GC-rich microchromosomes (numerous chromosomes smaller than 5-10 Mb). These are gene dense and so are expected to contribute disproportionately to the total density of functional sequence. All current chromosome-level avian assemblies lack several microchromosomes (including the most recent chicken genome assembly, galGal5, Warren et al., 2017), and those microchromosomes that are assembled in the collared flycatcher exhibited reduced alignability (including in coding sequence) in this study.
While we also scored unassembled scaffolds for CEs, these too showed considerably reduced alignability. Long-read assemblies have the potential to improve the quality of large and complex genomes (Gordon et al., 2016), and the further application of such approaches to avian species would thus be expected to lead to increases in the amount of assembled, and subsequently alignable, functional sequence. While such improvements are expected to be genomewide, the potential for increasing quality is perhaps greatest for microchromosomes.
It is also notable that promoters were the only exon-associated site class that contained fewer conserved bases than expected under a random distribution of CE coverage. This is at first glance surprising considering the high constraint observed in many human promoter regions , and existing evidence for conservation at transcription factor binding sites and short tandem repeats within avian promoters (Abe & Gemmell, 2016). A possible explanation for this result is that promotor sequences contain a high proportion of assembly gaps (21.8%) in the collared flycatcher assembly, potentially due to their high GC content. This again highlights the potential increases in resolution of CE prediction that could be achieved with improved genome assemblies. In the end, it cannot be excluded that the observation of birds having a lesser total amount of conserved sequence relative to mammals is mainly due to methodological issues.

| Incorporating conserved noncoding elements when approximating genomic targets for selection
We found the proportion of noncoding elements among all CEs to be similar to that observed in mammals (~70%), where even in human the majority of conserved noncoding bases remain functionally unannotated . Very little is known about the function of CNEs in avian genomes, with only one recent study utilizing chromatin immunoprecipitation sequencing (ChIP-seq) to investigate potential regulatory functions for avian-specific CNEs in chicken (Seki et al., 2017). Together with the observed reductions in nucleotide diversity of collared flycatcher CNEs, our data are consistent with CNEs evolving under purifying selection both over historical (conservation) and contemporary (reduced diversity) timescales. CNE density was negatively correlated with functional exonic density at local genomic scales, demonstrating that the latter alone is not an ideal proxy for the overall density of targets for selection. The collared flycatcher has been studied extensively in the field of speciation genomics Ellegren et al., 2012; Nadachowska-Brzyska, Burri, Smeds, & Ellegren, 2016;Nadachowska-Brzyska et al., 2013;, where it is becoming increasingly apparent that explicit modelling of the heterogeneity in genetic diversity is required in the interpretation of patterns of genetic differentiation between species, especially concerning the appearance of "divergence islands" (Cruickshank & Hahn, 2014;Wolf & Ellegren, 2017 to investigate the generality of these contrasting results from birds and mammals. Approximately a quarter of TE-derived CEs were present in the anole lizard genome and hence presumably predate the origin of the sauropsid clade~280 mya (Alfoldi et al., 2011). This demonstrates that many TE-derived CEs have an ancient origin and have been conserved for a very long time, and is consistent with the finding that orthologous sequences from exapted TEs in the human genome can be identified in chicken (implying an origin before the common ancestor of all amniotes; Gentles et al., 2007) and that a small number have even been found to predate the origin of ray-finned fish (Lowe & Haussler, 2012). The majority of TE-derived CEs were, however, absent from the anole lizard genome, and over 3,000 were inferred to be specific to the archosaur and avian lineages. Taken together, these findings reinforce the view that the exaptation of TEs is an evolutionary ancient phenomenon, which has continued at least throughout the early evolutionary history of birds, albeit with reduced quantity to that observed in mammals.

| Identification of lineage-specific conserved elements
We identified over 200,000 CEs specific to one of four evolutionary branches considered, namely archosaur-, avian-, neoavian-and passeridan-specific CEs. Relative to CEs inferred to be present in the sauropsid common ancestor, a far higher proportion of these lineage-specific CEs were noncoding. Functional noncoding sequence is thought to have reduced pleiotropy relative to coding sequence (Carroll, 2008), and it follows that the loss of functional noncoding sequence should be less deleterious than that of coding sequence and is expected to be more common (Carroll, 2005). It has thus been hypothesized that the proportion of lineage-specific functional noncoding sequence is substantially higher than that of protein-coding sequence, with evidence for extensive turnover of functional noncoding sequence in mammalian evolution (Mikkelsen et al., 2007;Rands et al., 2014;Smith, Brandstrom, & Ellegren, 2004), and for loss events of hundreds of CNEs in specific mammalian lineages (Hiller, Schaar, & Bejerano, 2012). Presumably, many of the putative CE gains we have identified have not substantially increased the overall amount of conserved sequence due to loss events for other CEs; however, exploring the specific dynamics of such turnover would require a more thorough assessment of CE gains and losses than has been performed in this study.
We also identified~2,500 CEs specific to the neoavian and passeridan lineages. These CEs are expected to be only a subset of the total number of functional elements specific to these lineages, as only the longest and/or most conserved elements will have been detected based on the available power in these relatively short branches. For example, the proportions of lineage-specific CEs at these scales that are noncoding (61.9% of neoavian-specific elements are CNEs, 49.8% for passeridan specific) is considerably lower than that observed for archosaur and avian-specific CEs (94.7% and 96.1%, respectively). While this may reflect true biology, it seems more probable that this is a result of a bias towards identifying exonic CEs when the power to detect conservation is low. Generally, identifying functional sequence that has arisen at recent evolutionary timescales has been a major challenge for methods that rely on conservation between multiple species, due to a lack of power provided by the short branch lengths that connect the aligned species at these timescales (Ponting, 2008;Rands et al., 2014). However, if the genomes of many species that are closely related to the species of interest can be aligned, reasonable branch lengths for the prediction of CEs should also be achievable at relatively recent evolutionary timescales . For two reasons, passerines (and songbirds in particular) may represent a particularly attractive group to further address this issue in the future. First, the passerine lineage (order Passeriformes) diversified~33-39 mya and contains~60% of all extant avian species Moyle et al., 2016); songbirds (oscine passerines) constitute a single lineage within the passerines and alone comprise the vast majority of these species, representing the largest radiation of birds (Barker, Cibois, Schikler, Feinstein, & Cracraft, 2004). With such extensive species richness, the songbirds represent an exceptional opportunity to achieve a relatively large total branch length at a recent evolutionary timescale.
Second, passerine birds are studied extensively to address numerous questions in evolutionary biology and ecology, and gaining further insight into lineage-specific functional innovations should be expected to aid and develop many of these research areas.
The ability to predict lineage-specific elements in this study was derived from the flexibility of the PROGRESSIVE CACTUS 23 sauropsid whole-genome alignment produced previously by Green et al. (2014). Approximately 1.8 Mb of conserved sequence was identified in the collared flycatcher genome that is not present in the chicken genome, revealing regions putatively involved in functional innovations on the collared flycatcher lineage, or functional losses on the chicken lineage. Such loss events for CEs have recently enabled regulatory sequence to be mapped to putative phenotypes in mammals (Marcovitz et al., 2016), exemplifying the potential uses of identifying such candidate regions. The whole-genome alignment used in this study could be adapted to predict CEs explicitly in any of the other 22 included species, and lineage-specific CEs could be identified in any of the clades containing several species, including the Psittacidae (parrots), Crocodilia and Testudines; three lineages that to our knowledge have not been scored for CEs previously. Such flexibility is therefore of great potential use in exploring evolutionary innovations in previously understudied groups, and we expect that PROGRESSIVE CACTUS alignments will be frequently used in the future to extend the prediction of CEs to a wide range of nonmodel species, especially in well-studied groups such as mammals and birds where increasingly species-rich alignments are anticipated.

| CONCLUSION S
PROGRESSIVE CACTUS whole-genome alignments can be used to efficiently identify conserved elements in any organism included within the alignment. Such an approach avoids biases introduced by reference-based alignments and enables the identification of elements specific to previously unstudied lineages. Our results, based on sequence conservation across species, as well as reduced diversity within a population, reinforce the finding that the use of coding or exonic sequences alone will fail to accurately capture the density and distribution of sites evolving under purifying selection within eukaryotic genomes. With >70% of conserved sites being nonexonic, these data will help to reveal the full genomic landscape of putatively functional sequence including identification of elements involved with lineage-specific adaptations.