Transposable element annotation in non‐model species: The benefits of species‐specific repeat libraries using semi‐automated EDTA and DeepTE de novo pipelines

Transposable elements (TEs) are significant genomic components which can be detected either through sequence homology against existing databases or de novo, with the latter potentially reducing the risk of underestimating TE abundance. Here, we describe the semi‐automated generation of a de novo TE library using the newly developed EDTA pipeline and DeepTE classifier in a non‐model teleost (Corydoras fulleri). Using both genomic and transcriptomic data, we assess this de novo pipeline's performance across four TE based metrics: (i) abundance, (ii) composition, (iii) fragmentation, and (iv) age distributions. We then compare the results to those found when using a curated teleost library (Danio rerio). We identify quantitative differences in these metrics and highlight how TE library choice can have major impacts on TE‐based estimates in non‐model species.


| INTRODUC TI ON
Transposable elements (TEs) are sequences of repetitive, non-coding DNA found in high abundance across the tree of life (Bourque et al., 2018;Wells & Feschotte, 2020;Wicker et al., 2007). Historically overlooked during genomic analysis and annotation, TEs are now recognised as key contributors to genome evolution and regulation, providing alternative promoters, neofunctionalisation, novel exons, and large-scale rearrangements (Bourque et al., 2018;Cowley and Oakey 2013;Hoen & Bureau, 2015). This realisation, coupled with the increased availability of genome sequences, has generated a growing need for both accessible and comprehensive TE annotation in non-model species.
TEs can be detected using either homology or de novo approaches. Homology-based approaches detect TEs through sequence comparisons against existing databases, whilst de novo approaches identify TEs through signatures such as structure or elevated copy number (Kennedy et al., 2011;Ou et al., 2019). Homology searches may lead to TE underestimates because sequence divergence can render certain TEs unrecognisable, which may be particularly common in non-model organisms where there may be large phylogenetic distances to their closest database entry (Bergman & Quesneville, 2007). Furthermore, due to their potential absence from databases, homology-based searches may bias detection away from species-specific TEs which have inserted since the common ancestor of focal species and library (Platt et al., 2016). This may be particularly true in the case of horizontally transferred TEs which are increasingly recognised to move between vertebrate genomes and may be important for long term TE persistence (Groth & Blumenstiel, 2017;Panaud, 2016;Zhang et al.,2020). Consequently, the generation of TE libraries that do not rely solely on homology-based searching is recommended (Hoen & Bureau, 2015;Platt et al., 2016).
However, de novo TE libraries also have disadvantages, as they may fail to detect low-copy number elements or erroneously identify/ classify TEs (Bergman & Quesneville, 2007). De novo TE libraries may therefore require a degree of manual curation, which can be both time consuming and labour intensive.
Several semi-automated pipelines for de novo library construction have been created to streamline their development, both in terms of annotation and classification. These include the Extensive de novo TE Annotator (EDTA) (Ou et al., 2019) and DeepTE (Yan et al., 2020). EDTA combines a suite of best-performing packages (LTR_FINDER, LTRharvest, LTR_retriever, Generic Repeat Finder, TIR Learner, HelitronScanner and RepeatMasker) to produce nonredundant TE libraries. EDTA also has an option to use RepeatModeler to do a final sweep for remaining unidentified TEs, thereby utilising two very powerful TE annotating tools (Ou et al., 2019). After initially performing well in rice (Oryza sativa, Ou et al., 2019), EDTA has subsequently been run across numerous nonvertebrate genomes, including sweet corn (Zea mays, Hu et al., 2021), field mustard (Brassica rapa, Cai et al., 2021) and sawfly (Euura lappo, Michell et al., 2021). DeepTE classifies TEs using machine learning, specifically by using convolutional neural networks to assign TEs to superfamily and order, with good performance in terms of accuracy and sensitivity against other similar classifiers as well in the assignment of previously unknown TEs (Yan et al., 2020). However, the impacts of EDTA's implementation on TE annotation in non-model, vertebrate genomes, particularly when combined with DeepTE, have yet to be fully explored.
In this study, we describe the use of both EDTA and DeepTE to construct a de novo library for Corydoras fulleri, a member of the Corydoradinae which are a species-rich subfamily of Neotropical catfishes with highly variable TE content (Alexandrou et al., 2011, Marburger et al., 2018. Teleost genomes contain the most abundant and diverse TE content of all vertebrates, including numerous horizontally integrated elements, making them interesting organisms to assess de novo pipelines (Sotero-Caio et al., 2017;Zhang et al., 2020). Here, we compare the performance of our Corydoras-specific TE library against the RepBase D. rerio library by estimating TE content within two Corydoras species; Corydoras fulleri and Corydoras maculifer. Specifically, our Corydoras-specific TE library was quantitatively assessed using estimates of four key TE-based metrics; (i) abundance, (ii) composition, (iii) fragmentation, (i.e., the likelihood that genomic TE copies have not been captured in a single contiguous manner during library creation), and (iv) sequence divergence distributions. We also use a mixture of both genomic and transcriptomic sequences to test how library type affects TE landscapes across different transposon age groups. Finally, we present this pipeline as a GitHub resource that will be applicable to a diverse range of species in the future ( Figure 1, https://github.com/ellen bell/FasTE).

| Extraction and sequencing of DNA and genome assembly
The genome of C. fulleri was assembled using both long-read PacBio sequencing and short-read Illumina Sequencing. Genomic DNA was extracted from C. fulleri using MagAttract HMW DNA Kit (Qiagen) for high molecular weight PacBio sequencing and PureLink Genomic DNA mini kit (by ThermoFisher Scientific) for Illumina Hiseq.

Sequencing was performed on two PacBio Sequel cells and one
Hiseq lane using 300 bp paired-end reads, which was estimated to generate 60x long-read PacBio coverage and 100x Illumina Hiseq coverage. All genomic library preparation and sequencing of C. fulleri was performed by Novogene Co Ltd.
The genome of C. maculifer was assembled using short-read Illumina based sequencing. Genomic DNA was extracted from C. maculifer using Q iagen DNeasy Blood and Tissue extraction kit.  (Leggett et al., 2014) and combined with contigs from paired-end assemblies using SOAPdenovo2 (Luo et al., 2012) under default settings but using a kmer size of 19 to produce scaffolds. Genome coverage for both assemblies was assessed using Quast (version 5.0.2, Gurevich et al., 2013) and completeness mea-  (Table S1).

| Extraction and sequencing of RNA and transcriptome assembly
The transcriptome of C. maculifer was assembled from short read Illumina based sequencing. RNA extraction (TRIzol Plus RNA Purification Kit) was conducted on somatic muscle tissue. The size selection and integrity of the extracted RNA was confirmed using an Agilent 2100 Bioanalyser (Agilent Technologies) which met internal QC standards of the sequencing provider. Transcriptomic library preparation and sequencing was performed by the Animal Biotechnology Laboratory of Esalq/Piracicaba and the cDNA library was then built using a TruSeq RNA Sample Prep kit (Illumina, Inc).
The C. maculifer cDNA was sequenced using paired-end sequencing on an Illumina HiSeq (as part of a larger multiplexed run), generating 10.09 million paired reads. The library was then demultiplexed and cleaned using Trimmomatic (version 0.2.36, Bolger et al., 2014) and subsequently assembled using the de novo transcriptome assembler Trinity (version 2.6.9, Grabherr et al., 2013). Transcriptome quality was later assessed using TransRate (version 1.03, Smith-Unna et al., 2016) (Table S2).

| Transposable element annotation
A de novo TE library was generated from the long-read PacBio C. fulleri genome using the Extensive de novo TE Annotator (EDTA) (Ou et al., 2019) set to the "others" species parameter. We utilised the inbuilt RepeatModeller (Smit and Hubley, 2008) support which identifies any remaining TEs which might have been overlooked by the EDTA algorithm (--sensitive 1). Classifications within this library were refined using DeepTE using the predefined metazoan model parameter setting (-m) (Yan et al., 2020). TE identification was performed using RepeatMasker (RM; version 1.332) utilising the NCBI/ RMBLAST (version 2.6.0+) search engine. This analysis was conducted either against the Danio rerio Repbase (26 October 2018) entry, which was also run through DeepTE (to allow for uniformity in F I G U R E 1 FasTE pipeline schematic. TE classification, referred to as the "D. rerio library" henceforth), or the Corydoras-specific library. RM was run under the most sensitive (-s) parameter setting in all instances. The genomic and transcriptomic RM output files were subsequently cleaned of nondistinct elements by removing overlapping repeats where a match with a higher likelihood score was available. Outputs were then parsed through a custom R script; RM_TRIPS (which is publically available at https://github.com/clbut ler/RM_TRIPS). RM_TRIPS was used to (i) remove repetitive elements not classed as TEs (e.g., microsatellites, simple repeats & sRNAs), (ii) merge elements found on the same contig if they had the same name, orientation, and their combined sequence length was less than or equal to the corresponding reference sequence in the repeat library, (iii) remove merged repeats with a length less than 80 base pairs, and (iv) for transcriptomic data, if multiple identical repeats were found across different transcript isoforms, only one was retained. This was to ensure that each repeat represented a unique genomic locus. This complete pipeline from de novo library generation through to RM output parsing has been consolidated into the annotated tool, FasTE, which is publically available at https://github.com/ellen bell/ FasTE (see Figure 1). Age distributions of TEs were compared across library types using their sequence divergence from library entry as a proxy. This made use of the RM outputs which reports the percentage of substitutions in a matching TE compared to its corresponding library hit.

| Comparative assessment of the performance of Corydoras-specific TE library
Age/sequence divergence distributions were generated for the four major TE classes -DNA transposons, LTR retrotransposons, SINEs and LINEs.
To investigate the potential origin of C. maculifer Mariner elements, we extracted every genomic copy with a matching length of >80% against its library hit, and every transcript copy where an element made up >80% of the transcript's length. We subsequently ran a BLASTn search against the RepeatMasker library, with elements potentially horizontally inherited if sequences had both (i) a best match (lowest E value) against a non-teleost species and (ii) following rationale used in (Rogers et al., 2018) a greater than 2% sequence similarity than its best teleost hit. Figures were produced using the ggplot2 package in R (Wickham, 2016).

| RE SULTS
To assess the impact of de novo library creation using EDTA/DeepTE pipelines we generated a de novo TE library (Corydoras-specific) from a long-read (PacBio) Corydoras fulleri genome assembly and benchmarked it against the D. rerio RepBase entry. TE content was then assessed across two Corydoras species and sequence types including: (i) a C. fulleri genome (ii) a C. maculifer genome (another species of the same lineage) and (iii) a C. maculifer transcriptome (Figure 2a).

| Use of the Corydoras-specific TE library led to a 2-3-fold increase in TE abundance estimates
Total TE abundance estimates were higher across both species and sequence types when using the Corydoras-specific library. For C.   Figure S1).

| Use of the Corydoras-specific TE library led to substantial changes in estimated TE composition
Using the Corydoras-specific library impacted TE composition estimates across both species and sequence types, which we assessed  (Figure 3(b) iii). To investigate whether any compositional bias had been introduced by DeepTE, we also ran the curated RepBase D. rerio TE library through DeepTE and compared its classification outputs against the original RepBase library. As expected, the RepBase curated library had a greater range of classifications than the DeepTE classified library ( Figure S2). Although general classification patterns were similar, there was some bias exhibited by DeepTE towards both TIR elements (hAT and Mariner-like) and LTR elements (BEL and Copia) ( Figure S2).

| The Corydoras-specific library was more fragmented when compared to a curated TE library
We assessed the degree of fragmentation in the Corydoras-specific library against the curated D. rerio library using (i) cumulative frequency of estimated individual TE abundances and (ii) TE length distributions across the C. fulleri genome (Figure 4). We define fragmentation as genomic TE copies which have not been captured as a single contiguous unit during library creation ( Figure 5). An excess of fragmented TE library entries will push a cumulative frequency curve further to the right because many entries will be found at low abundance within the genome (singletons) ( Figure 5). When standardised by total number of hits, we found little difference be-

| The Corydoras-specific TE library reduces average TE age estimates
We investigated the impact the Corydoras-specific library had on es-  (Gao et al., 2016).
Theoretically, an inverse relationship between homology-based identification rates and phylogenetic distance exists, in which sequence differences between the species used to develop a library and the target species may provide an obstacle for accurate TE detection. A previous comparison across 40 mammalian genomes demonstrated that TE detection rates exhibit a 'threshold limit', in which TE abundance underestimates are largely avoided until a phylogenetic distance greater than ~90MY is reached, above which homology-based searching may detect as few as 20% of total TEs (Platt et al., 2016). It is therefore no surprise that Corydoras TE content was probably underestimated when assessed using the D. rerio library given that these species are separated by ~150 million years of evolution (Chen et al., 2013).
In addition, estimated transcriptomic TE abundance was approximately an order of magnitude lower than genomic content, which probably reflects the fact that: (i) TEs may largely be located within non-coding regions of the genome, (ii) many TEs found within Corydoras genomes may be degraded and no longer possess the ability to be transposed, or (iii) epigenetic silencing mechanisms (such as CpG methylation and histone modifications) may prevent TE expression (Slotkin & Martienssen, 2007). It is also worth noting that this study used RNA-seq data originating from somatic muscle tissue. TE expression is likely to vary between different tissue types, theoretically evolving to be most active in the germline and comparatively silent in the soma (Haig, 2016).
TE composition estimates within the Corydoras were found to be similar to that of other teleosts, with DDE DNA transposons being the most abundant TE class, largely driven by a high abundance of Tc1-Mariner and hAT elements (Shao et al., 2019). Due to their "blurry promoters" Mariner elements appear to have a particular propensity for horizontal transfer across the vertebrate kingdom (Zhang et al., 2020). Despite the suggestion that homology-based methods may miss horizontally transferred elements, a BLASTn search against genomic C. maculifer Mariner elements (see methods for full details) demonstrated that the percentage that have a best hit against a non-teleost species differed very little between library type (5.58% for the Corydoras-specific and 4.22% for the D. rerio library). Interestingly, it appears that the percentage of expressed Mariner elements with a best hit against a non-teleost species within the C. maculifer transcriptome was much higher than within the genome (17.14%), suggesting potential horizontally transferred elements may be more likely to be under purifying selection and retain their transposition ability (Zhang et al., 2020). The evolutionary impacts of horizontally transferred TEs are potentially wide-reaching F I G U R E 5 Schematic depicting TE fragmentation as a result of de novo library creation. Fragmentation during de novo library creation occurs when single TE copies are detected as multiple fragmented copies. This creates an overinflation of unique library entries, and results in a skewed cumulative frequency curve due to an excess of singletons (TEs detected once only) TE6  TE1 TE2  TE3 TE4 TE5   TE1  TE2  TE3 2) De novo library creaƟon (e.g. EDTA)

TE Entries (%)
(see Schaack et al., 2010), and thus their accurate annotation is important. More conservative testing across a wider range of elements would be required to fully investigate the role that library type has on the detection of horizontally transferred TEs, and is an important avenue to explore in the future.
The use of the D. rerio TE library led to skewed estimates of TE compositions. In particular, homology-based searching inflated the relative proportion of genomic SINE elements to a level equivalent to DDE DNA transposons, which was unexpected given other teleost species contain particularly SINE-depleted genomes (Gao et al., 2016;Shao et al., 2019) and potentially caused by a predisposition for homology-based searching against the detection of certain TE classes (e.g., DNA transposons). Furthermore, the majority of SINEs found by homology searching were represented by a single SINE element, which was not present in the Corydorasspecific library, suggesting a failure to comprehensively detect SINE elements during de novo library creation. This finding was, in part, expected: SINE elements have a propensity to be missed during de novo library creation because of high sequence variation and lack of terminal repeats and supports a prediction made by Ou et al. (2019). Such compositional differences were not observed within transcriptomic sequences, possibly because expressed TEs (which are often younger) tend to have higher levels of sequence similarity (Lanciano & Cristofari, 2020). Over-reliance on homology-based searching may lead to similar inaccuracies during TE abundance and compositional estimates, particularly when working with organisms that are phylogenetically distant from a model organism in which a curated TE library exists.

The substantial increase in individual elements detected in the
Corydoras-specific library compared to the D. rerio library raises the possibility of false discovery and/or fragmentation, whereby libraries contain multiple fragmented entries representing different regions of a single contiguous element (Flynn et al., 2020). Both false discovery and library fragmentation are common pitfalls associated with de novo pipelines (Flynn et al., 2020;Ou et al., 2019,)  TRANSCRIPTOME when comparing fragmentation levels between de novo generated libraries produced by different pipelines (Flynn et al., 2020 (Brawand et al., 2014;Schartl et al., 2019). Furthermore, the variation in estimated sequence divergence associated with the Corydoras-specific library was larger than the D. rerio library, suggesting that the Corydoras-specific library was able to detect TEs of a greater age range. Taken together these findings further support the notion that the majority of TEs detected using the D. rerio library are probably those inherited from the common ancestor of the Corydoras and D. rerio. Additionally, we found that elements found within transcriptomic data had a greater probability of being less divergent than their genomic counterparts, further supporting the hypothesis that expressed TEs tend to be both younger and more intact, with a greater potential for active transposition.

ACK N OWLED G EM ENTS
We would like to thank Dr Alexander Suh and Professor Tracey Chapman for their valuable insights on the analysis and compilation of this manuscript and to Dr Riviane Garcez da Silva for her help in RNA extraction. The genomic assemblies presented in this paper were produced on the High Performance Computing Cluster supported by the Research and Specialist Computing Support service at the University of East Anglia.

CO N FLI C T O F I NTE R E S T
None.

AUTH O R CO NTR I B UTI O N S
The study was conceived and developed by Ellen A. Bell, Christopher