Abundant recent activity of retrovirus‐like retrotransposons within and among flycatcher species implies a rich source of structural variation in songbird genomes

Transposable elements (TEs) are genomic parasites capable of inserting virtually anywhere in the host genome, with manifold consequences for gene expression, DNA methylation and genomic stability. Notably, they can contribute to phenotypic variation and hence be associated with, for example, local adaptation and speciation. However, some organisms such as birds have been widely noted for the low densities of TEs in their genomes and this has been attributed to a potential dearth in transposition during their evolution. Here, we show that avian evolution witnessed diverse and abundant transposition on very recent timescales. First, we made an in‐depth repeat annotation of the collared flycatcher genome, including identification of 23 new, retrovirus‐like LTR retrotransposon families. Then, using whole‐genome resequencing data from 200 Ficedula flycatchers, we detected 11,888 polymorphic TE insertions (TE presence/absence variations, TEVs) that segregated within and among species. The density of TEVs was one every 1.5–2.5 Mb per individual, with heterozygosities of 0.12–0.16. The majority of TEVs belonged to some 10 different LTR families, most of which are specific to the flycatcher lineage. TEVs were validated by tracing the segregation of hundreds of TEVs across a three‐generation pedigree of collared flycatchers and also by their utility as markers recapitulating the phylogenetic relationships among flycatcher species. Our results suggest frequent germline invasions of songbird genomes by novel retroviruses as a rich source of structural variation, which may have had underappreciated phenotypic consequences for the diversification of this species‐rich group of birds.

Understanding past and ongoing TE activity is thus an important aspect of elucidating the molecular basis of phenotypic changes.
Birds have the smallest genomes among amniotes (Gregory, 2017), ranging from an estimated 0.9 Gb in the black-chinned hummingbird to 2.1 Gb in the ostrich (Gregory, Andrews, McGuire, & Witt, 2009;Wright, Gregory, & Witt, 2014) among some 758 species for which genome size has been estimated (Gregory, 2017). Their average genome size is 1.33 Gb and thus approximately a third of that of mammals (Hillier et al., 2004;. Importantly, the diminutiveness of avian genomes coincides with significantly lower densities of transposable elements (TEs) than in other land vertebrates (~10% in birds vs.~30% nonavian reptiles and~50% in mammals; Chalopin, Naville, Plard, Galiana, & Volff, 2015;Kordi s, 2009;Sotero-Caio, Platt, Suh, & Ray, 2017). Most bird species exhibit relatively high degree of chromosomal synteny (Ellegren, 2010) and stability of chromosome numbers with many species showing 2n % 80 (Ruiz-Herrera, Farr e, & Robinson, 2012). A common explanation for this stability is the scarcity of TEs in birds under the assumption that chromosomal rearrangements are related to or promoted by TE activity (Janes, Organ, Fujita, Shedlock, & Edwards, 2010;Shedlock, 2006;Shedlock et al., 2007). Moreover, it has often been suggested that the small size of avian genomes (and thereby indirectly low TE densities) is the result of constraints on cell size (and thus genome size) associated with the metabolic requirements of powered flight (Gregory et al., 2009;Hughes & Hughes, 1995;Vinogradov, 1997;Wright et al., 2014;Zhang & Edwards, 2012).
While both phylogenetic analyses of TE presence/absence in different species (Baker, Haddrath, McPherson, & Cloutier, 2014;Kaiser, van Tuinen, & Ellegren, 2007;Suh, Smeds, & Ellegren, 2015;Suh et al., 2011Suh et al., , 2016 and analyses of TE divergence landscapes Zhang et al., 2014) demonstrate that there was TE activity across the breadth of the avian phylogeny, the low densities of TEs in avian genomes imply that overall TE activity was likely lower than in other amniotes (Chalopin et al., 2015;Kordi s, 2009;Sotero-Caio et al., 2017). However, it is not known whether this is also a characteristic of contemporary bird species such that TE insertions would play a minor role in ongoing trait evolution and adaptation of birds. To address this question, there is a need for large-scale analyses of TE presence/absence variation (TEV) in avian genomes to quantify recent TE activity. Previously, only a single TEV has been reported in grebes and chickens, respectively (Lee et al., 2017;Suh, Kriegs, Donnellan, Brosius, & Schmitz, 2012).
The recent availability of massive whole-genome resequencing data paves the way for estimating the amount and character of TEVs segregating within populations. Studies of model organisms including human (Rishishwar, Tellez Villa, & Jordan, 2015;Sudmant et al., 2015), mouse (Nell aker et al., 2012), Drosophila melanogaster (Barr on, Fiston-Lavier, Petrov, & Gonz alez, 2014;Kim et al., 2014), yeast (Jeffares et al., 2015;Liti et al., 2009) and Arabidopsis thaliana (Quadrana et al., 2016;Stuart et al., 2016) have shown that rates and diversity of transpositions vary significantly between and within populations. This is likely of significance to evolutionary processes in population and species differentiation.
The latter is particularly important as it serves as baseline information on the amount of genetic variation present within these species.
We in-depth annotated the TE content of the flycatcher genome and scanned whole-genome resequencing data of 200 genomes from 10 populations of four Ficedula flycatcher species for TEVs. In addition, we screened a three-generation pedigree of collared flycatchers for TEVs and followed the inheritance of segregating variants. This provides, to our knowledge, one of the first large-scale assessments of TEVs segregating in natural populations of nonmodel species.

| Generation of collared flycatcher TE landscapes
We merged the consensus sequences of the finalized repeat curation and classification (Data S1; also deposited in Dfam_consensus, http://dfam-consensus.org/) with all avian repeat consensus sequences present in Repbase (mostly from chicken and zebra finch).
This custom library was used for annotation of the male collared flycatcher reference genome fAlb15 (in-house version of FicAlb1.5; Kawakami et al., 2014) and female collared flycatcher W chromosome sequence  using RepeatMasker (version 4.0.1; Smit, Hubley, & Green, 1996-2010parameters -a -xsmallgccalc). We then generated TE landscapes from the resultant .align files as described elsewhere (Suh, Churakov et al., 2015). The male reference individual was from Sweden (Gotland, Baltic Sea; B), and the female used for W chromosome assembly was from Hungary (H).

| TEV prediction in 200-flycatcher and pedigree resequencing data
Using .bam files from whole-genome resequencing data (Burri et al., 2015), we predicted TE insertions absent from the reference genome assembly (nonreference TEVs) from discordant read mapping ( Figure 1) against the fAlb15 reference using RetroSeq (version 1.41; Keane, Wong, & Adams, 2013). We chose RetroSeq over other available TEV detection programs (reviewed in Rishishwar, Mariño-Ram ırez, & Jordan, 2017) because it is, to our knowledge, the most widely used program and was among the best performing in human TEV benchmarking (Rishishwar et al., 2017) after MELT (Gardner et al., 2017) and Mobster (Thung et al., 2014). We employed a series of very strict filters to make sure that only the most confident TEV loci remained for downstream analyses. (1) We excluded TEV loci from resequenced reads that F I G U R E 1 Schematic overview of TEV discovery from resequencing data. (a) A situation where the reference genome contains an empty target site (white square; i.e., TEV absence) at a unique genomic locus, and a resequenced individual contains a TE insertion (red) flanked by a duplicated target site (i.e., TEV presence), (b) mapping of resequencing reads against the reference will yield properly mapped read pairs (blue) and discordant read pairs (orange) where one of the reads instead maps to a library of transposable elements consensus sequences [Colour figure can be viewed at wileyonlinelibrary.com] overlapped with other fAlb15 reference repeats including 200-bp flanks. The 200-bp flanks correspond to two times the read length, which we consider necessary to minimize the number of false TEV calls due to erroneous paired-end read mapping in a short-read assembly. This filter is even more conservative than other studies using RetroSeq (e.g., 100-bp flanks; Lammers, Gallus, Janke, & Nilsson, 2017). For this step, we conducted fAlb15 repeat-masking with the same input library as for the TE landscape analyses (i.e., flycatcher repeats + avian repeats from Repbase) but added a partially curated library of hooded crow repeats (Vijay et al., 2016) to mask as many potential repeat regions as possible; (2) We excluded TEV loci that overlapped with fAlb15 reference tandem repeats. We ran Tandem Repeats Finder (Benson et al., 2013) on a hard-masked version of fAlb15 (RepeatMasker annotation masked as "N" nucleotides) to maximize the masking of tandem repeats; (3) We excluded TEV loci that overlapped with "N" assembly gaps including 200-bp flanks (Lammers et al., 2017) and scaffold ends including 200-bp flanks, as well as scaffolds smaller than 1 kb; (4) We excluded TEV loci that overlapped with regions that remained unmapped in SAMtools mpileup (Li et al., 2009) when mapping resequencing data from the reference individual against the fAlb15 reference assembly; (5) Finally, we excluded TEV loci where none of the TEV calls passed all internal RetroSeq filters (i.e., call status FL = 8; FL < 8 indicates calls which failed a particular filter: "1-depth too high in region, 2-not enough reads in cluster, 3-not enough total flanking reads, 4-not enough inconsistently mapped reads, 5-neither side passes ratio test, 6-one side passes ratio test, 7-distance too large at breakpoint"; Keane et al., 2013). The resultant filtered VCF files contained 11,888 TEV loci in the 200-flycatcher data (Data S5) and 713 TEV loci in the pedigree data (Data S6). The genomic locations of these TEVs were annotated using the collared flycatcher ENSEMBL annotation as intergenic regions, introns, coding sequences and untranslated regions.

| Comparison of TEV vs. SNP heterozygosities and derived allele frequencies
Given that coverage in the resequencing data ranges from 5 to 279 among the 200 flycatcher genomes (Burri et al., 2015), we considered the coverage too low to allow for reliable homozygous/heterozygous TEV genotype calls in RetroSeq. TEVs were therefore treated as dominant markers, that is, markers where homozygotes and heterozygotes cannot be distinguished, as classically has been the case for data with presence/absence alleles such as amplified fragment length polymorphisms (AFLPs; Excoffier & Heckel, 2006). As there is limited evidence for inbreeding in collared flycatchers (Dutoit et al., 2017), we imputed heterozygosity levels assuming Hardy-Weinberg equilibrium by letting the frequency of absence calls be q 2 in p 2 + 2pq + q 2 = 1. Additionally, we imputed heterozygosities using the software AFLP-SURV (Vekemans, Beauwens, Lemaire, & Rold an-Ruiz, 2002), which is based on a Bayesian method with a nonuniform prior distribution (Zhivotovsky, 1999), under Hardy-Weinberg equilibrium or observed levels of

| In-depth TE annotation of the collared flycatcher genome
We conducted an in-depth annotation of repetitive elements in the collared flycatcher reference genome assembly fAlb15 by combining automated de-novo prediction of repetitive elements in RepeatModeler, and manual curation and classification of repeat consensus sequences. The manual curation was particularly important for identifying LTR retrotransposons (or their solo-LTRs) because RepeatModeler automatically classified most of these as "unknown" TEs.
Following the general classification systems for TEs (Kapitonov & Jurka, 2008;Wicker et al., 2007), we defined our manually curated TE consensus sequences as TE subfamilies (Table S1). If these subfamilies exhibited sequence similarity to known TEs in CENSOR or to other subfamilies from our annotation, we considered these to belong to the same TE family (Table S1). TE subfamilies or groups of similar TE subfamilies were then defined as novel TE families (prefix "fAlb"; Table S1) if they had no significant sequence similarity to known TE families in CENSOR. We identified a total of six satellite repeat subfamilies (some of these LTR-derived) and 85 TE subfamilies, the latter grouping into five CR1 long interspersed element (LINE) families (see Figure S1 for their phylogenetic relationships), nine endogenous retrovirus 1 (ERV1) long-terminal repeat (LTR) families, 19 ERVK LTR families and 11 ERVL families (Table S1). Among these, five ERV1 LTR families, 14 ERVK LTR families and four ERVL LTR families (i.e., 23 in total) represented previously unknown families ( Figure S2).
In total, we identified 193,611 TEs or TE fragments in the collared flycatcher genome assembly using RepeatMasker (Table 1), representing 5.5% of the genome (Table 1). Also including all forms of tandem repeats, the total repeat content of the assembly was 11.7%.

| Timescale of TE activity in the collared flycatcher genome
To estimate the relative timescales of activity of different repeat families/subfamilies, we analysed the distribution of pairwise Kimura 2-parameter distance between TEs and their respective consensus sequence, also called TE landscapes (Figure 2). For autosomal TEs, we dated their approximate age of insertion using the estimated mutation rate of collared flycatcher of 2.3 9 10 À9 mutations per site per year (Smeds, Qvarnstrom, & Ellegren, 2016). The dated TE landscape suggested that the vast majority of detectable TE sequences are ancient, with most TE accumulation (especially CR1 retrotransposons) occurring between 33 and 54 million years ago (MYA). Note that there is an upper limit to how far back in time TE activity can be reliably detected, as at some point mutations will have accumulated such that sequences are beyond recognition as TEs. However, the distribution of pairwise distances also suggested that there has been TE activity all the way until in the very recent past, indicating that TEs may segregate in contemporary populations. In fact, the extent of very recent TE activity is likely to be underestimated from the occurrence of young TEs in genome assemblies due to issues leading to collapsed or unassembled repeats.
Notably, over 90% of autosomal TE sequences presumably younger than 11 MY (Figure 2; that is, <5% distance to consensus) belonged to the group of LTR retrotransposons (mostly from the 23 novel LTR families; Figure S2). CR1 retrotransposons, the "typical" TEs of bird genomes (Hillier et al., 2004;Warren et al., 2010;Zhang et al., 2014), only made up <10% of the bp and underwent a decrease in accumulation towards present times; very few CR1 retrotransposons accumulated within the last 5 MY.
The TE landscape of the Z sex chromosome was overall similar in shape but with slightly higher TE densities. The female-specific W chromosome had very high densities of retrovirus-like LTR retrotransposons (Figure 2; cf. . The trend of relatively recent replacement of CR1 activity by LTR activity was also visible on the Z and W sex chromosomes (Figure 2, cf. . genomes that were absent in the collared flycatcher genome assembly ("nonreference TEVs"; see Figure 1), potentially representing TE presence/absence variation (TEV). Given the challenge in confidently distinguishing between homozygote and heterozygote TEV calls (see Materials and Methods), it should be noted that "presence" means "presence of one or two allelic copies." We refrained from analysing TEVs representing TE insertions present in the reference genome assembly ("reference TEVs") but absent in individuals from the resequencing data. This was mainly because short-read assemblies (such as collared flycatcher) typically contain many assembly gaps in TE sequences, especially for copies belonging to young TE families, which we assumed problematic for presence/absence mapping. Also, and more generally, demonstrating the presence of a sequence in resequencing data is more straightforward than proving its absence (cf. below, concerning a dependence of coverage on the ability to detect TE insertions).
After very stringent filtering of TEV calls predicted by RetroSeq (Keane et al., 2013), we found a total of 11,888 TEV loci where the T A B L E 1 RepeatMasker annotation of the male fAlb15 assembly and the female W chromosome assembly from collared flycatcher using a manually curated collared flycatcher repeat library merged with avian repeats from Repbase Repeat copy numbers and total basepairs were taken from the RepeatMasker .tbl file where copy numbers are estimated after merging repeat fragments derived from the same insertion event. Note that the collared flycatcher assembly is based on short reads, and repeat copy numbers may be overestimated for young TEs interrupted by assembly gaps.

SUH ET AL.
| 103 TE insertion was absent from the collared flycatcher assembly. Of these TEVs, 10,211 segregated within or among the four black-andwhite species (see further below; Figure S3, Data S5). Most TEVs belonged to eight different LTR families, notably from all three major groups of endogenous retroviruses ERV1, ERVK and ERVL (  Table S2).
Phylogenetic network analysis (Huson, 1998) (Nater et al., 2015). Furthermore, the distribution and extent of reticulations within the phylogenetic network were in agreement with the previously noted high prevalence of phylogenetic discordance due to incomplete lineage sorting at the base of this rapid four-species radiation (Nater et al., 2015). This demonstrates that TEVs, at least in these species, behave in a similar way to nucleotide sequence polymorphisms in terms of how they segregate within and between species. However, we note that as our analysis is based on nonreference TEVs, the amount of phylogenetically concordant TEVs may be underestimated for collared flycatchers.
For the species with population samples, we identified between 13 and 630 TEVs per individual. Importantly, both within each of these species and for all species taken together, there was a strong positive relationship between genomic sequence coverage and number of TEVs detected per individual (Figure 4). A near-linear increase in the number of TEVs was apparent across the whole range of sequence coverage, from 29 to 279 (Table S3).
Not surprisingly given the observed relationship between coverage and TEV detection, we found the largest number of TEVs in the population with the highest mean sequencing coverage (Atlas flycatcher, n = 20 individuals, 2,446 TEVs, mean coverage = 20.99;    Figure S5, Table 4). Mean heterozygosity per population was between 0.12 and 0.16, and the mean derived allele frequency between 0.09 and 0.15 (Table 4). We obtained similar heterozygosity estimates when using a Bayesian method with nonuniform prior distribution (Zhivotovsky, 1999) (Table S4) Table S4). When we limited diversity estimation to only include individuals with >159 genomic coverage (Table 4), estimates of heterozygosity as well as derived allele frequency were on average slightly higher than when using the full data set. This is not unexpected given the observed relationship between coverage and ability to detect TEVs. The allele frequency spectrum indicated that about half of all TEVs were private, that is, present in only a single individual ( Figure S5b). On top of the overall challenges in population genetic analysis of dominant markers, it is difficult to compare these diversity estimates with data from SNPs due to the very different procedures and protocols for genotyping. However, the results were largely in agreement with previous SNP data from the same flycatcher populations, which show similar polymorphism levels and allele frequency spectra (Burri et al., 2015;Dutoit et al., 2017).
We attempted to infer the timescales of retrotransposition events leading to TEVs within and among Ficedula flycatchers by parsimonybased mapping of TEVs on the phylogenetic tree of the sampled populations and species. This revealed frequent retrotransposition across initial (representing TEVs shared among two or more species), shallow (TEVs shared among two or more populations within species), as well as terminal (species-or population-specific TEVs) branches of the phylogeny (Figure 5b). Given the rapid divergence of these species, and limited differentiation among populations within species, sequencing F I G U R E 3 Emergence of novel LTR retrotransposon families (FA) and subfamilies (SF) in the three avian genomes with manually curated TE annotations. The divergence estimates are based on timetrees of major avian taxa (Jarvis et al., 2014) and passerines (Moyle et al., 2016). Each LTR family is defined as a group of similar LTR subfamilies with no sequence similarity to LTR subfamilies from a different LTR family.

| Verification of TEVs via a three-generation pedigree
We finally sought to validate TEVs and confirm stable Mendelian inheritance by tracing the segregation of TEVs across a three-generation pedigree of collared flycatchers (sequenced to 36-459 coverage; Smeds, Mugal et al., 2016) consisting of paternal and maternal grandparents (i.e., four individuals in the P generation), mother and father (two F 1 individuals) and five full-siblings (F 2 individuals; Figure S6a). We identified 713 nonreference TEVs in the pedigree (Data S6) and most of these belonged to the eight most common TEV families identified in the 200-genomes data set ( this is not unexpected if they represent P heterozygotes). Moreover, to directly be able to follow the inheritance of heterozygous loci, we focused on TEVs present in one P and one F 1 individual, as these markers should be very strong candidates for heterozygosity in the F 1 generation. We then followed their inheritance to the F 2 generation and estimated the binomial probabilities of inheriting TEV presence in zero to five F 2 individuals. The observed and expected distributions were very similar (Table S6). Overall, this suggests a low rate of false positives in the pipeline used for TEV detection and filtering in this study.
Furthermore, the pedigree information permitted us to estimate a minimum false-negative rate from those TEVs present in one or more individuals in a fashion that disagreed with the pedigree. We found 27 TEVs where a missing TEV presence call would explain the observed TEV call distribution across the pedigree, which would translate into a minimum rate of 3.8% false negatives (Table S5,    With in-depth repeat annotations from only two songbird lineages, it is of course difficult to generalize from these observations, but we hypothesize that songbirds in general are frequently exposed to germline infiltration by retroviruses. If so, the vast majority of songbird retrovirus diversity might yet await discovery.
This genome-scale study provides novel insights into how recent transposition activity in an avian lineage sets the stage for a rich source of structural variation within species as well as between closely related species. We consider the identification of 11,888 nonreference TEVs from 200 flycatcher genomes to be a conservative estimate for the amount of TE-derived structural variation in this sample, mainly because we only analysed nonreference TEVs. Reference TEVs likely constitute an additional source of TEVs in the sample but were not analysed due to methodological challenges.
Furthermore, we required the TEV loci to have >200-bp distance to assembly gaps and repeats in the collared flycatcher reference (thus excluding 369 Mb or 33% of the reference genome) and filtered TEVs for call quality more conservatively than originally proposed for the RetroSeq program (Keane et al., 2013). Finally, we note that the ability to detect nonreference TEVs in resequencing data was highly sensitive to coverage. Confidently scoring the full breadth of TEVs present in an individual is likely to require >309 of genomic coverage, at least with the present pipeline for scoring TEVs. This is consistent with recent observations from human TEV benchmarking (Rishishwar et al., 2017).
We independently verified TEV calls by tracing 713 TEVs across a three-generation pedigree of collared flycatchers. Only 3.8% of these TEVs were inconsistent with the pedigree (Figure S6), and 425 TEVs were traced across all three generations, suggesting that false negatives and false positives did not significantly contribute to the observed patterns of TEV abundance and diversity. This is further supported by inheritance patterns of putatively heterozygous TEVs from the F 1 generation to the five F 2 individuals (Table S6). Furthermore, population-pooled analysis of the 11,888 TEVs as phylogenetic markers ( Figure S4) recapitulated the phylogenetic relationships among the flycatcher species and populations (Burri et al., 2015), including the previously noted high degree of phylogenetic discordance in the deepest branching event between the four Ficedula species (Nater et al., 2015).
Importantly, our data highlight the potential importance of TEs in adaptation and speciation in birds. First, the high LTR retrotransposon activity during the last 20-30 MY in the lineage leading to flycatcher (as well as in the independent lineage leading to zebra finch) suggests that LTR insertions might have influenced the evolution of many traits. Second, the frequent occurrence of polymorphic TEs among and within contemporary flycatcher species may very well imply that TEVs have contributed to recent speciation events and currently contribute to fitness variation among individuals. These observations therefore suggest that genome scans aimed at identification of loci or genomic regions involved in speciation and adaptation need to integrate screening for TEVs in candidate regions.
Few thoroughly repeat-annotated genomes are available for birds. The only in-depth genome annotations that so far are comparable to our manually curated collared flycatcher repeat annotation are for chicken (Hillier et al., 2004) and zebra finch (Warren et al., 2010). The high diversity of newly discovered retrovirus-like LTR families in collared flycatcher, most of which are lineage-specific, suggests a high and yet largely unexplored diversity of songbirdinfecting retroviruses. We note that this unusual diversity of retroviruses coincides with the massive diversification of songbirds into thousands of species. Future research on population dynamics of retrovirus-like TEs will likely reveal the extent to which arms races with these genomic parasites impacted the population and species differentiation of songbirds. Additionally, our population-scale TEV data pave the way to elucidate how TEVs contributed to phenotypic variation through their manifold effects on transcriptional regulation, 3D genome folding and chromosomal stability.

DATA ACCESSIBILI TY
Repeat consensus sequences and VCF files of TEV calls are available as data files in supporting information.