Transposable elements in individual genotypes of Drosophila simulans

Abstract Transposable elements are abundant, dynamic components of the genome that affect organismal phenotypes and fitness. In Drosophila melanogaster, they have increased in abundance as the species spread out of Africa, and different populations differ in their transposable element content. However, very little is currently known about how transposable elements differ between individual genotypes, and how that relates to the population dynamics of transposable elements overall. The sister species of D. melanogaster, D. simulans, has also recently become cosmopolitan, and panels of inbred genotypes exist from cosmopolitan and African flies. Therefore, we can determine whether the differences in colonizing populations are repeated in D. simulans, what the dynamics of transposable elements are in individual genotypes, and how that compares to wild flies. After estimating copy number in cosmopolitan and African D. simulans, I find that transposable element load is higher in flies from cosmopolitan populations. In addition, transposable element load varies considerably between populations, between genotypes, but not overall between wild and inbred lines. Certain genotypes either contain active transposable elements or are more permissive of transposition and accumulate copies of particular transposable elements. Overall, it is important to quantify genotype‐specific transposable element dynamics as well as population averages to understand the dynamics of transposable element accumulation over time.

Drosophila melanogaster recently evolved to be a human commensal and spread out of Africa to a worldwide distribution (around 10,000 years ago (Baudry, 2004;Kauer, Zangerl, Dieringer, & Schlötterer, 2002;Sprengelmeyer et al., 2018;Wu et al., 1995;Yukilevich, Turner, Aoki, Nuzhdin, & True, 2010)). When organisms colonize new habitats, conditions may be stressful and they may encounter new congeners. Both of these conditions could potentially result in an increase in transposable element activity, through introgression and reduced efficacy of the organism's system for repressing transposable element activity, such as piRNA (Engels, 1992;Kofler, Nolte, et al., 2015). D. melanogaster from Africa have been observed to have a lower number of transposable element insertions than cosmopolitan D. melanogaster, which has been attributed to a "waking up" of transposable elements upon colonization of new habitats . The sister species of D. melanogaster, D. simulans, also originated in Afrotropical climates and evolved into human commensals with cosmopolitan distributions throughout Europe and the Americas, albeit more recently (Sturtevant, 1920).
Due to its more recent spread, and heterogeneity among populations in their transposable element content, it was previously proposed that the waking up of transposable elements in D. simulans is currently in progress .
More recently, the most frequently used approach to studying transposable element abundance in D. melanogaster and D. simulans has been Pool-seq. Pool-seq has generated some interesting observations about transposable element dynamics; for example, in D. melanogaster it has confirmed that transposable elements are more abundant in cosmopolitan populations than in their ancestral African range (Kofler, Nolte, et al., 2015). Pool-seq documented the recent invasion of the P-element into D. simulans from D. melanogaster, highlighting the ever-changing transposable element landscape between species and populations (Kofler, Hill, Nolte, Betancourt, & Schlötterer, 2015a). While Pool-seq may an effective tool for estimating population-level frequency, there is evidence that estimates of transposable element insertion dynamics can be confounded by differences in allele frequencies (Rahman et al. 2015). Furthermore, it is informative to estimate the variance between genotypes in transposable element copy number, in addition to population-level variation. For example, how much of the observed population-level variation is due to individuals with high copy number rather than low population averages?
In D. melanogaster, the existence of multiple sequenced inbred panels lends themselves to estimating copy number and insertion site frequency between individual genotypes. Active families of transposable elements appear to be largely shared between populations, for example, in inbred strains of D. melanogaster from worldwide samples, the DGRP, and pooled noninbred flies from global samples; the majority of transposable element insertions are from the same six transposable element families (Rahman et al. 2015).
However, these estimates of specific differences in transposable element load between genotypes were performed on a limited number of strains and have not been performed in other systems, including in D. simulans.
Here, I will specifically address three of these questions in D. simulans, to understand what observations from D. melanogaster are unique to the species and which are shared. First, how do transposon families differ between fly genotypes and which transposon families are most prevalent in these differences? Second, how do transposable elements differ between cosmopolitan and ancestral D. simulans? Third, how much difference do we see between D. simulans sequenced from inbred lines versus those sequenced directly from wild collections? I estimate variance in transposable element copy number between inbred genotypes, differences between wild and inbred lines, and differences between the populations in the mean and variance of transposable element copy number.  Jackson, Campos, Haddrill, Charlesworth, & Zeng, 2017). They were inbred in the laboratory for nine generations. During the process of inbreeding, five were lost and were sequenced from the original wild sample which had been preserved in ETOH (Table 1). These five lines will be used as an estimate of "wild" Drosophila transposable element load, compared to the inbred lines. The raw reads are 90-bp pairedend Illumina sequencing, and they were downloaded from SRA PRJEB7673 (Jackson et al., 2017). The first read from each pair was used for mapping. The 169 California lines were collected from the Zuma Organic Orchard in Los Angeles, CA, on two consecutive weekends of February 2012 (Table 1; Signor, New, & Nuzhdin, 2017;Signor & Nuzhdin, 2018, 2019. Reads were singleend 100 bp, and this project has been deposited at the SRA under accession SRP075682.

| Mapping and copy number estimation
Example scripts for all of the following methods are available at https://github.com/signo r-molev ol/simul ans_trans posable. Reads were mapped using BWA-MEM version 0.7.15 to the D. simulans 2.02 assembly and the 179 consensus transposable element sequences from EMBL, downloaded from Flybase.org (Figure 1; Li, 2015, reference also available at https://github.com/signo r-molev ol/simul ans_trans posable). Of these, 128 were used for the analysis, removing those from non-D. melanogaster species that did not have a presence in D. simulans. Bam files were sorted and indexed with SAMtools v.1.9, and optical duplicates were removed using picard MarkDuplicates (http://picard.sourc eforge.net) (Li et al., 2009;McKenna et al., 2010). Reads with a mapping quality of below 15 were removed (this removes reads which map equally well to more than one location). Using read coverage to determine copy number has been compared to other methods and is neither permissive nor conservative (Srivastav & Kelleher, 2017).
Transposable element copy number was estimated per family by estimating the average counts of reads mapping to the transposable element sequences and the genome with bedtools counts (Hill, Schlötterer, & Betancourt, 2015;Quinlan & Hall, 2010). Then, copy number of the transposable elements could be normalized using the average counts from 2 L in R. Significance of the difference between populations was determined using a t test for means and an F test for variance. p-values of comparisons between means and variances were corrected for multiple testing using Bonferroni correction.

| SNPs and summary statistics
I called SNPs within the consensus sequence of the transposable elements and the genomes using GATK Haplotypecaller (McKenna et al., 2010). SNPs were filtered for a minimum depth of four. SNPs were not filtered for missing calls given that not at all individuals will share insertions. Tajima's D was estimated in windows of 1 kb using VCFtools, and prior to estimation indels and SNPs with more than two alleles were removed (Danecek et al., 2011). The site frequency spectrum of SNPs was estimated with VCFtools as the frequency of each SNP in the population, and then the frequency of the SNP frequencies was estimated in R (Danecek et al., 2011).

| Population-level variation
Of the 128 elements examined in the population, 85 have different mean numbers of insertions between the two populations (t test, Bonferroni-corrected p = .05/128, Table 2, Table S1). Of those, only 17 are higher in the African populations, suggesting that overall the CA population has more transposable element insertion sites. Indeed, overall Californian D. simulans have an average of 1,797 insertions per genotype, while African D. simulans have 1,496 ( Table 2). The five elements with the largest difference in copy number in Californian D. simulans compared to African are the INE-1, Tc1, transib2, 1,360, and Cr1a (Table 2). These are present on average in 37 more copies in The most abundant transposable elements in each population tend to be abundant in both populations, namely INE-1, Cr1a, and G6. The D. melanogaster pogo and Helitron elements were not present in these populations, which has been previously noted, suggesting that these transposable elements are not present in D. simulans (Kofler, Nolte, et al., 2015). Previous work using Pool-seq in D. simulans identified INE-1, roo, Cr1a, Rt1c, and hobo as the most abundant transposable elements in D. simulans, and G6 was among the less abundant elements (Kofler, Nolte, et al., 2015).
Some elements are not present in full-length copies within either population. Six transposable elements (Stalker4, Stalker, Bari2, Tc3, G7, and Tart-C) were never present as more than a fraction of an element in any individual, and they are likely old and degraded. G3 and hop-per2 are estimated as being present in ~1 copy per individual in both populations; however, that copy or copies has internal deletions. For the G-element, all but a small fraction of reads map to one 140 bp sequence. A full-length version of Quasimodo (two copies) and gypsy6 (one copy) were present in one genotype, while in other genotypes Quasimodo appears to be old and degraded. Stalker3 is also present in one genotype as a full-length copy; however in this case, old or degraded copies are not present in the other genotypes. Reads which map equally well to more than one location were filtered out; thus, this does not represent nonspecific mapping to repetitive elements. This is consistent with other work on Juan, which suggests it is actively transposing in the species (Kofler, Nolte, et al., 2015). The larger number of transposable elements with low-frequency SNPs in African populations may be due to the overall difference in the site frequency spectrum between populations ( Figure S1; Signor et al., 2017).

| The p-element
The p-element recently invaded D. simulans from D. melanogaster as described in Kofler, Hill, et al. (2015); however, Pool-seq cannot tie pelement insertions to specific individuals and only determine the average number of insertions. What was reported previously was 0.4 insertions in Florida populations and 29 in South Africa (Kofler, Hill, et al., 2015). What we see in the California population is an average of two insertions, however that is because the majority of individuals do not have any insertions (137 individuals have less than 0.3 estimated copies, Figure 2). The remaining individuals have between 0.5 and 39 copies. It is interesting that it is not invading genotypes in the population at the same rate, but rather reaching high copy number in some genotypes and not others (Nuzhdin, 2000). It is possible that p-elements are just proliferating in strains that contained an active copy prior to collection (Nuzhdin, Pasyukova, & Mackay, 1997). This was observed previously in laboratory strains of D. melanogaster, though contamination and introgression may also have played a role (Rahman et al. 2015).

| Transposable elements in individual genotypes
Some transposable elements have considerably higher copy number in particular genotypes compared to the population average. For example, in one genotype Dsim\ninja is present in 29 copies, compared to the population mean of three (Figure 3). Dsim\ninja has 10 fixed differences and 27 polymorphisms in this strain from the California population, and the population average is 7.5 fixed differences and 264 polymorphisms. This suggests that Dsim\ninja was recently active in this genotype. This is true of several transposable elements which have outliers in the population. Stalker2 has an outlier genotype with 17 fixed SNPs and eight polymorphisms, compared to a population average of 14 fixed SNPs and 43 polymorphisms. Other transposable elements with large outliers in the California population include gypsy10, opus, blood, GATE, diver, Tabor, INE-1, diver2, idefix, 1731, 412, and 297. Sampling of the African populations was much more limited; thus, less genotype-specific variation is sampled, and indeed, only two transposable elements had large outliers, in both the same genotype from Madagascar: copia and diver. This genotype had 20 copies of copia, compared to a population frequency of 4-11, as well as 11 fixed differences and 20 polymorphisms (compared to a population average of 10 fixed differences and 78 polymorphisms). For diver, this genotype had 20 fixed differences and 84 polymorphisms, compared to a population average of 12 fixed differences and 215 polymorphisms (and 30 copies compared to 4-10 for the rest of the population).

| Wild versus inbred strains of D. simulans
The outlier genotype from Africa that has more copies of copia and diver is one that was inbred in the laboratory. In general, being inbred in the laboratory is not affecting overall transposable element copy number; however, as comparing between lines that were sequenced directly upon collection and those that there inbred, there is no significant difference between the mean number of transposable elements for any transposable element family. The activity of copia and diver is specific to a genotype, rather than to "wild" or "inbred" strains. Those that "wake up" in individual lines appears to be due to sampling of individuals that are permissive or contain active transposable elements, rather than an overall increase in transposable element activity in inbred lines.

| Comparison to other studies
Tirant has previously been reported as having higher copy number in African D. simulans, potentially due to a recent mobilization of the element (Fablet, McDonald, Biémont, & Vieira, 2006). We find TA B L E 1 A list of the strains used for this study, including their collection location and inbreeding status in California (Fablet et al., 2006). The Dmau\mariner element has a higher copy number in Africa than in the Californian D. simulans, from 0-5 with an average of 2.33, compared to 0-3 with an average of 1.22 (Figure 3). Dmau\mariner also contains no polymorphisms,  (Kofler, Nolte, et al., 2015). In addition, in the populations reported here the G6 element has primarily low-frequency polymorphisms ( Table 3), suggesting that this is a recent expansion of copy number. Overall, our estimates are higher than the work of Kofler, Hill, et al., 2015, which focuses on euchromatic insertions and only estimates more than one insertion per line for four transposable elements (1,360,hobo,roo,. 1,360,jockey,hobo,roo, have been estimated as the most abundant transposable elements in D. melanogaster (Rahman et al., 2015;Kofler, Nolte, et al., 2015).  suggesting that colonization is associated with increased transposable element activity . However, overall the lack of reporting of individual population values makes comparison difficult.

| Comparison to D. melanogaster
In D. simulans, there is some evidence, either genotypes with large increases in copy number or a site frequency spectrum biased toward low-frequency alleles, that Dsim\ninja, Dmau\mariner, p-element, gypsy10, opus, blood, GATE, diver, Tabor, INE-1, diver2, Idefix, 1731, 412, 297, G6, flea, Bari1, Transpac, Tabor, accord, and Juan are active. gypsy10, blood, Juan, G6, Tabor, Transpac, accord, and diver have been previously reported as undergoing a burst of activity in D. simulans and in D. melanogaster, likely due to recent invasion (Kofler, Nolte, et al., 2015). Flea, Idefix, 412, and 297 are also thought to be active, though due to an older invasion in the genome of D. melanogaster (Kofler, Nolte, et al., 2015). G6 has been reported as having low copy number in D. melanogaster; however, it was also potentially recently active.
Thus, D. melanogaster and D. simulans share many active families of transposable elements and appear to be experiencing an increase in transposable element copy number concurrent with worldwide expansion.

| CON CLUS IONS
Drosophila simulans is currently being invaded by transposable elements, and this spread is likely occurring concordant with the worldwide colonization of D. simulans, as has been posited by previous studies (Lachaise et al., 1988;Vieira et al., 1999;Biémont et al., 2003). African populations have their own transposable element dynamics, with some transposable elements seeming to share activity between populations (G6) and others being more active in African D. simulans (baggins,Bari1,etc.). It would be interesting to explore transposable element dynamics in other populations of D. simulans to understand the generality of the patterns seen here. Transposable element load is an attribute of species, populations, and individual F I G U R E 3 (a,b) Estimated copy number for the two non-Drosophila melanogaster transposable elements included here, Dsim\ninja and Dmau\ mariner. (c) The site frequency spectrum of Drosophila sim\ninja for Californian and Africa D. simulans. The site frequency spectrum of D. sim\ninja is broad, suggesting that outside of the genotype with an active copy of D. sim\ ninja this element has been diverging within this species for some time. In contrast, no SNPs were called in Dmau\ mariner, suggesting recent colonization in D. simulans genotypes. In inbred laboratory genotypes, active transposable element copies may be inherited by some genotypes and not others, and active transposable elements can accumulate over time (Nuzhdin et al., 1997). This can cause differences over time in the number of insertions within a genotype and large differences between genotypes in transposable element copy number (Nuzhdin et al., 1997). This may also be reflective of natural patterns in which transposable elements proliferate in particular genotypes rather than at low levels in the population as a whole (Nuzhdin, 2000).
Overall, looking at variance between individuals is an important part of understanding the ways in which transposable elements maintain themselves in populations.

ACK N OWLED G M ENTS
I would like to thank C. and S. Emery for helpful commentary on the manuscript. I am also thankful to J. Butler and T. Robert for help in the laboratory.

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
S.S. conceived the study, performed the analysis, and wrote the paper.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data are available at the Sequence Read Archive under SRP075682 and PRJEB7673.