Novel quality metrics allow identifying and generating high‐quality assemblies of piRNA clusters

In most animals, it is thought that the proliferation of a transposable element (TE) is stopped when the TE jumps into a piRNA cluster. Despite this central importance, little is known about the composition and the evolutionary dynamics of piRNA clusters. This is largely because piRNA clusters are notoriously difficult to assemble as they are frequently composed of highly repetitive DNA. With long reads, we may finally be able to obtain reliable assemblies of piRNA clusters. Unfortunately, it is unclear how to generate and identify the best assemblies, as many assembly strategies exist and standard quality metrics are ignorant of TEs. To address these problems, we introduce several novel quality metrics that assess: (a) the fraction of completely assembled piRNA clusters, (b) the quality of the assembled clusters and (c) whether an assembly captures the overall TE landscape of an organisms (i.e. the abundance, the number of SNPs and internal deletions of all TE families). The requirements for computing these metrics vary, ranging from annotations of piRNA clusters to consensus sequences of TEs and genomic sequencing data. Using these novel metrics, we evaluate the effect of assembly algorithm, polishing, read length, coverage, residual polymorphisms and finally identify strategies that yield reliable assemblies of piRNA clusters. Based on an optimized approach, we provide assemblies for the two Drosophila melanogaster strains Canton‐S and Pi2. About 80% of known piRNA clusters were assembled in both strains. Finally, we demonstrate the generality of our approach by extending our metrics to humans and Arabidopsis thaliana.

to combat the spread of TEs (Brennecke et al., 2007;Lewis et al., 2018;Yang et al., 2017). In mammals and invertebrates, the host defence against TEs is based on piRNAs, that is small RNAs with a size between 23 and 29 nt (Brennecke et al., 2007;Gunawardane et al., 2007). These piRNAs bind to PIWI-clade proteins that silence TEs at the transcriptional as well as the post-transcriptional level (Brennecke et al., 2007;Gunawardane et al., 2007;Le Thomas et al., 2013;Sienski et al., 2012). piRNAs are derived from discrete genomic loci, termed piRNA clusters, which may make up substantial portions of genomes (e.g. 3.5% in D. melanogaster). piRNA clusters play a central role in the defence against TE invasions (Bergman et al., 2006;Zanni et al., 2013). Under the currently prevailing view, the trap model, it is assumed that a newly invading TE is stopped when a copy of the TE jumps into a piRNA cluster, which triggers the production of piRNAs that silence the TE (Bergman et al., 2006;Duc et al., 2019;Goriaux et al., 2014;Malone & Hannon, 2009;Ozata et al., 2019;Yamanaka et al., 2014;Zanni et al., 2013). Despite the central importance of piRNA clusters in the defence against TEs, the composition and evolution of these regions remains poorly understood. A better understanding of these regions could shed light on important open questions in TE biology, like whether or not the trap model holds (Kofler, 2019(Kofler, , 2020Mohamed et al., 2020). Our lack of knowledge comes mostly from the fact that piRNA clusters are notoriously difficult to assemble. Most piRNA clusters are located in the heterochromatin and consist of highly repetitive sequences such as TEs (Asif-Laidin et al., 2017;Brennecke et al., 2007;Zanni et al., 2013). Long-read sequencing (e.g. by Pacific Biosciences or Oxford Nanopore Technology) promises to close this gap in our understanding by enabling us to obtain complete assemblies of piRNA clusters. However, it is currently not clear which assembly strategies yield reliable assemblies of piRNA clusters, since many different assembly tools, polishing strategies, sequencing data and scaffolding approaches may be used. In fact, it is not even clear on how to identify the best assemblies, as classic quality metrics such as BUSCO and NG50 are ignorant of TEs and piRNA clusters: BUSCO (Benchmarking Universal Single-Copy Orthologs) provides the fraction of correctly assembled (i.e. 'complete') core genes (Simão et al., 2015;Waterhouse et al., 2018) and NG50 gives the size of the smallest contig out of the largest contigs that account for 50% of the reference genome (Earl et al., 2011).
Here, we address these challenges by first introducing several novel quality metrics that assess the number and the quality of the assembled piRNA clusters. Our novel quality metrics were then used to evaluate the effect of different assembly algorithms, polishing approaches, read lengths, coverages, levels of residual polymorphisms and scaffolding methods. Based on these results, we identify strategies that generate high-quality assemblies of piRNA clusters. Using such an optimized assembly strategy, we provide novel assemblies for the Drosophila melanogaster strains Canton-S and Pi2. Additionally, we demonstrate the generality of our approach by extending our metrics to humans and A. thaliana. We provide a user-friendly pipeline, a manual and a walkthrough for assessing the quality of assembled piRNA clusters.
The strain Iso-1 was solely sequenced using the Illumina paired-end technology.
High molecular weight DNA for Oxford Nanopore sequencing was extracted from whole bodies of 50 female virgin flies using the Phenol-Chloroform extraction protocol described by Maniatis et al. (1982) using slightly elongated incubation times (5 min). The DNA was sheared to a mean fragment length of 20-30 kb with Covaris g-TUBEs (Covaris Inc., Woburn, MA, USA). The length of the DNA was measured with a TapeStation (4200; DNA ScreenTape, Agilent Technologies). Library preparation was performed with an input of 2-5 μg of sheared DNA following the manufacturer's protocol (kit LSK108; Oxford Nanopore Technologies; Oxford). About 1-2 μg DNA for Illumina paired-end sequencing was extracted from whole bodies of 20-30 virgin female flies using a salt-extraction protocol (Maniatis et al., 1982). Libraries were prepared with the

| Assemblies
Short-read assemblies with Illumina paired-end reads (read length 125 and mean coverage of 30×) were performed with abyss (Simpson et al., 2009) (version 2.1.5; abyss-pe) using a k-mer size of 96.
We generated the assemblies with miNiasm using default settings.
The resulting assembly graph files (gfa) were transformed into fastafiles with awk. We launched wtdbg2 with the raw nanopore reads and a nanopore-specific preset ('preset2'). flye was launched with the raw nanopore reads with the corresponding option (--nano-raw) and default parameters.
Polishing of long-read assemblies was carried out in two steps.
We first used racoN (Vaser et al., 2017)   . The optimal number of polishing iterations was chosen based on the maximally achieved BUSCO (Benchmarking Universal Single-Copy Orthologs) values (Table S2).
An assembly graph was provided. Reference-guided scaffolding was performed with ragoo (Alonge et al., 2019) (version 1.1) based on release 6 of the D. melanogaster reference genome (Hoskins et al., 2015).
Random sampling of reads was performed with seqtk (https:// github.com/lh3/seqtk) (version 1.3-r106). To obtain subsets of the longest reads, we sorted all reads by length and then used the appropriate number of the first reads (i.e. the longest reads). Polishing of assemblies generated with subsets of reads was carried out with the respective subsets.
For an overview of our assembly pipeline, see Figure S1.
The final assemblies were based on caNu using 100× of the longest reads. Misassemblies were identified based on Hi-C heatmaps and alignments of the assemblies to the reference genome (dotplots).
Hi-C heatmaps were generated with juicer (Durand et al., 2016) (version 1.7.6) using 'Sau3AI' as the restriction enzyme. Heatmaps were visualized and analysed with juicebox (1.11.08). Potential misassemblies identified in the Hi-C heatmaps were cross-validated with long reads that were aligned to the assemblies. Breaks in the alignment of the long reads were interpreted as support of an assembly error.
Contigs with misassemblies were broken with a custom script 'in-troduceBreaks2fasta.py'. Potential contamination (e.g. adaptor sequences) was removed using the standard tools implemented by NCBI.

| Quality of assemblies
busco (Waterhouse et al., 2018) (version 3.0.2) values were based on the diptera _odb9 data set (2799 genes). quast (Gurevich et al., 2013) (version 5.0.2; quast-lg) was used to compute basic assembly statistics such as NG50 and the total assembly length. As reference, we used the genome of D. melanogaster (release 6).
Computing our TE landscape metrics (abundance of TEs, number of SNPs and internal deletions (IDs) within TEs) requires Illumina raw sequencing reads (expectations) and artificial reads generated from an assembly of interest (observations). We generated artificial reads of length 125 bp starting at each position of the assembly (yielding a uniform distribution; artificial-reads-for-assembly.py). The abundance of TEs, as well as the number of SNPs and internal deletions within TEs, was estimated with deviate (Weilguny & Kofler, 2019) (version 0.3.6) to obtain both the expected values (Illumina raw reads) and the observed values (artificial reads derived from the assembly). As reference library for deviate, we used the consensus sequences of TE families present in D. melanogaster (v9.42; we added the sequence of Mariner: M14653) (Quesneville et al., 2005). Solely SNPs and internal deletions with a minimum frequency of 2% were considered. The GC content for each TE was calculated via a custom script ('GC-content-calculator.py').
The CUSCO metric relies on the annotation of piRNA clusters of D. melanogaster release 5 (Brennecke et al., 2007;Hoskins et al., 2007). From the 142 annotated piRNA clusters, we excluded clusters that were annotated at the ends of chromosomes (10) and on the highly fragmented U-chromosome (46) (as flanking sequences can not be obtained for these clusters). For the remaining 86 clusters, we identified sequences flanking the clusters at both ends.
These flanking sequences were required to align uniquely to release 6 of the D. melanogaster reference genome. For two piRNA clusters that were adjacent to each other (cluster 8 and 9), we could only obtain a pair of sequences flanking both clusters. In summary, we were able to design flanking sequences for 85 piRNA clusters. These sequences had a size between 49 and 12,567 nucleotides. To compute the CUSCO, the flanking sequences were aligned to an assembly using bwa mem ). The CUSCO was computed with the script 'cusco.py' as the fraction of complete piRNA clusters (i.e. both flanking sequences aligned to the same contig/scaffold).
We furthermore distinguished between an ungapped-CUSCO and a gapped-CUSCO based on the presence of poly-N sequences between the two sequences flanking a piRNA cluster. Poly-N tracts in assemblies were identified using the script 'find-polyN.py'. To determine whether piRNA clusters are uniquely assembled, we tested if both flanks mapped to multiple contigs/scaffolds using the script 'multi-cluster.py'.
To identify assembly errors in piRNA clusters, we aligned long reads to assemblies using miNimap2 (Li, 2018) (version 2.16-r922). A list of complete BUSCO genes was obtained from the BUSCO pipeline (diptera_odb9; 2799 genes). Based on these data, we computed the base coverage and the soft-clip coverage along each piRNA cluster as well as the 99% quantiles of these coverages (quantiles.py). To calculate the base-coverage quality (CQ), we divided the standard deviation of base coverage in a cluster by the median of standard
The PCR products were loaded on a 1% agarose gel and ran with 120 V for 30 min in TBE buffer. The expected length of amplicons was inferred from the assemblies. Only polymorphic TE insertions for which both breakpoints agreed with the expectations were assumed to be successfully validated.

| Data analysis
To identify heterozygous SNPs, we aligned Illumina paired-end reads to release 6 of the D. melanogaster genome with bwa mem using default parameters. Reads with a low mapping quality were removed using samtools (version 1.7; ), a mpileup file was created (samtools), and allele frequency estimates were obtained using mpileup2sync (popoolatioN2;  with the parameters --fastq-type sanger --min-qual 20. The fraction of heterozygous SNPs for windows of 100 kb was computed with a custom script (polymorphicSNPs_from_sync.py). To account for sequencing errors, we solely classified SNPs with allele frequencies between 0.25 and 0.75 as segregating. Furthermore, a minimum coverage of 10 was required for each site. Windows with insufficient coverage at more than 25% of the sites were excluded. Finally, solely windows with sufficient coverage in all three samples (Pi2, Canton-S and Iso-1) were retained.
To identify the redundant contigs, we chopped assemblies into nonoverlapping fragments of 1 kb using a custom script (chopgenome.py) and aligned them to the release 6 of the D. melanogaster genome using bwa bwasw with default parameters (version 0.7.17-r1188; . Ambiguously mapped reads were filtered with samtools (-q 20), and a mpileup file was generated. The mean coverage for 100 kb windows was calculated using a custom script (coverage_from_pileup.py).
We used sNiffles (version 1.0.7; Sedlazeck, Rescheneder, et al., 2018) to identify structural variants (SVs). Such SVs may either be present or absent in the assembly (classified as deletion and insertion, respectively). We first mapped the long reads to assemblies using Ngmlr (v0.2.7; Sedlazeck, Rescheneder, et al., 2018) with the parameter -x ont (ONT data as input). SVs were identified with sNiffles using the parameters --report_seq (obtain the sequence of SVs) and --genotype (report allele frequency estimates of SVs). The resulting vcf-file was filtered for SVs with a minimum length of 1kb.
To obtain heterozygous SVs, we filtered for allele frequencies between 25% and 75%. To identify SVs caused by TEs, we aligned the sequences of SVs to the consensus sequences of TEs (Quesneville et al., 2005) using blastN (2.7.1+, Altschul et al., 1990). using the parameters: -no_is (skip checking for bacterial insertions), -nolow (skip masking low complexity regions) and D. melanogaster TE sequences (Quesneville et al., 2005) or Drosophila TE sequences (Bao et al., 2015). Synteny within piRNA clusters among the assemblies was identified with blastN (2.7.1+, Altschul et al., 1990). To avoid cluttering of the figure, we removed annotations of TEs smaller than 1 kb and blastN similarity blocks smaller than 3 kb. For D. melanogaster, we used assemblies from NCBI with following accession numbers: SIXD01000000 and SISJ02000000 (Ellison & Cao, 2020)

| Application in different species
To calculate the TE landscape metrics for humans, we used the repetitive sequences library for humans from RepBase (Bao et al., 2015) (version 23.10, humrep.ref) containing 1063 sequences. We compared a short-read and a long-read-based assembly derived from the same individual (KOREF) (Cho et al., 2016;Kim et al., 2019).
To obtain 'expected' values, short-read sequences of the KOREF individual were used (SRR2204705) (Cho et al., 2016). To obtain the 'observed' values, we created artificial reads for both the short-read (KOREF1.0 (Cho et al., 2016)) and the long-read assembly (KOREF PB_62x (Hi-C scaffolded) (Kim et al., 2019)). The landscape metrics were computed as described before.
To establish flanking sequences for piRNA clusters in humans, we used annotations of 168 piRNA clusters (Sarkar et al., 2014) in the human reference genome hg19. Flanking sequences were created using the scripts 'flankbeder.sh' and 'flankparser.sh'. For each piRNA cluster, the 5 kb regions flanking each cluster were first split into five regions of 1 kb each. Potential flanking sequences containing N's were removed, and the remaining sequences were aligned back to hg19 using bwa bwasw (version 0.7.17-r1188; ). The potential flanking sequences were required to align back to the origin (with a tolerance of 5 kb) with a minimum mapping quality (mq) of 5. For each piRNA cluster, the most proximal pair of flanking sequence meeting these criteria was retained (136 out of 168). To calculate the CUSCO values, the flanking sequences were mapped to the respective assemblies using bwa bwasw and CUSCO was calculated as described before.
We calculated CUSCO for 11 assemblies of humans

| Assembly quality of piRNA clusters
Here, we aim to identify strategies that enable us to generate highquality assemblies of piRNA clusters. Since commonly used assembly quality metrics, such as BUSCO and NG50 (Earl et al., 2011;Simão et al., 2015), are ignorant of TEs, we first developed several novel quality metrics. Our novel metrics assess whether assemblies (a) accurately reproduce the abundance and diversity of TEs (i.e. the TE landscape) of an organisms, (b) have complete piRNA clusters and (c) contain assembly errors within piRNA clusters.
To obtain a data set for demonstrating our novel metrics, we sequenced the D. melanogaster strain Canton-S with (a) the Oxford Nanopore long-read technology (coverage 150×, mean read length ≈ 7 kb), (b) Illumina paired-end sequencing (coverage 30×, read length 125 bp) and (c) Hi-C (coverage 530×, read length 125 bp) (Table S1).
With our first metrics, we tested whether an assembly accurately reproduces the TE content of an organism. A good representation of the TE composition is an important quality control of assemblies and a requirement for an accurate assembly of highly repetitive regions such as piRNA clusters. With these new metrics, we do however not estimate whether TE insertion sites are correct, as this would re-  Figure S2). For an assembly of interest, we compute the observed TE landscape by generating artificial reads using the assembly as template, which are then used with deviate to estimate abundance and diversity of TEs ( Figure 1a, Figure S3). To avoid biases and sampling noise, these artificial reads should be uniformly distributed across the assembly and have the same length as the Illumina raw reads used for inferring the expected TE landscape.
To summarize the representation of TEs across all TE families  Figure S4). This yields, in total, three novel quality metrics (slope of abundance, SNP count and ID count) that estimate how well an assembly captures the TE landscape.
High-quality assemblies that accurately reproduce the TE landscape will have regression slopes of ≈1.0 for each of the three features.
Assemblies that overestimate the TE abundance will have a slope >1.0 and assemblies that underestimate the TE abundance a slope <1.0. To illustrate the usage of these metrics, we generated two assemblies of Canton-S: (a) an assembly based on Illumina short reads (abyss; Simpson et al., 2009), and (b) an assembly based on ONT long reads (caNu; Koren et al., 2017) and several rounds of polishing using the long and the short reads (3× racoN and 3× piloN; (Vaser et al., 2017;Walker et al., 2014). For both short and long reads, the coverage was 30×. The short-read assembly poorly reproduced the TE landscape ( Figure 1b; Figure S4). The abundance of many families was underestimated, and the diversity of many TE families (SNPs and IDs) was overestimated (Figure 1b; Figure S4). By contrast, the long-read assembly captured the TE landscape much more accurately, with most slopes being close to the optimum (i.e. 1; Figure 1b; Figure S4). The high quality of long-read assemblies was also observed with different assembly algorithms ( Figure S5) and with unpolished assemblies ( Figure S4).
Next, we developed a novel metric that allows us to assess whether piRNA clusters are completely assembled. In essence, the

F I G U R E 1
With three novel metrics, we assess how well an assembly captures the TE landscape of an organism, that is the abundance of TEs as well as the number of SNPs and internal deletions (IDs) within TEs. (a) Our metrics are based on a comparison between the expected and the observed TE landscape. We illustrate these metrics by the example of the TE abundance. The expected TE abundance (exp.) is derived by aligning Illumina raw reads to consensus sequences of TEs and counting the fraction of reads mapping to each TE family. Different TE families are shown in red and blue. The observed TE abundance (obs.) is derived by generating artificial reads from assemblies of interest, aligning these reads to the consensus sequence of TEs and counting the reads. A high-quality assembly will capture the TE abundance more accurately (good) than a low-quality assembly, for example having several assembly gaps (bad). (b) To summarize these results across all TE families, we perform a regression between the expected and the observed TE abundance. The slope of the regression represents our novel quality metric for the TE abundance. Results are shown for a short-and long-read assembly of Canton-S (30× coverage for both). Each dot represents a distinct TE family, and the dashed line shows the optimal representation of the TE abundance. Note that the long-read assembly captures the TE abundance more accurately than the short-read assembly (despite expectations being based on short reads). Similarly to the TE abundance, the slope of regression can be computed for the number of SNPs and IDs found in TEs  (Brennecke et al., 2007). Flanking sequences close to piRNA clusters were preferred. These flanking sequences are then mapped to an assembly of interest. Here, we consider a piRNA cluster to  Figure S4). CUSCO values differed substantially between F I G U R E 2 Novel metrics for assessing the quality of assembled piRNA clusters. (a) The CUSCO value (Cluster BUSCO) estimates the percent of complete piRNA clusters in an assembly of interest. Unique sequences flanking piRNA clusters are mapped to an assembly, and the CUSCO value is computed as the percentage of clusters where both flanking sequences align to the same contig. Depending on whether or not poly-N sequences (i.e. assembly gaps) are tolerated between the flanking sequences, an ungapped-CUSCO (u.CUSCO) and a gapped-CUSCO (g.CUSCO) can be computed. (b) BUSCO and CUSCO values for different assemblies of Canton-S (30x coverage for short and long reads). Although long-and short-read assemblies have similar BUSCO values, CUSCO values differ substantially. (c, d) Assembly errors in complete piRNA clusters may be identified based on (c) base-coverage heterogeneity and (d) elevated numbers of soft-clipped reads. Long reads aligned to a correct and a wrong assembly of a piRNA cluster are shown black. Red indicates not-aligned regions of long reads (i.e. soft-clipped regions). A repeat sequence is shown in blue. Example of an assembled piRNA cluster having a high (e) and low (f) quality. The clusters are from the long-read assembly of Canton-S (30×). Dotted lines show the 99% quantiles of the base coverage and of the soft-clip coverage in BUSCO genes. As a rough summary of the assembly quality for individual piRNA clusters, we compute the CQ (coverage quality) and ScQ values (soft-clip quality)  the short-and long-read assemblies (Figure 2b). A mere 5.88% of piRNA clusters were complete with the short reads, while 60% were complete with long reads. As we did not perform scaffolding, only the ungapped-CUSCO was calculated (Figure 2b). By contrast, both assemblies show high BUSCO values, which illustrates that BUSCO is of limited use for estimating the suitability of assemblies for an analysis of piRNA clusters (Figure 2b).
However, even when both flanking sequences align to the same contig, a piRNA cluster may still be incorrectly assembled, for example if some internal sequences are missing in the assembly.
Therefore, we implemented two additional metrics that allow us to identify assembly errors in complete piRNA clusters (Figure 2c-f).
Long reads aligned to an assembly of interest provide two complementary pieces of information that may allow us to identify assem- However, some assembly errors, such as deleted or misplaced sequences, might not have noticeable effects on the base coverage.
These assembly problems are instead characterized by breaks in the assembly where sequences are joined in the assembly that are not joined in the genome of the organism. As a consequence, many reads spanning these breaks can only be partially aligned back to the assembly (Figure 2d). These reads are usually soft-clipped; that is, a terminal end of a read is either not aligning to any contig or aligning to an entirely different location. Soft-clipped reads can thus be used to identify assembly problems. Therefore, we propose to compute the will likely be the main application of these coverage-based metrics.
In summary, we developed novel quality metrics that enable us to estimate the assembly quality of piRNA clusters. First, the TE landscape metrics test whether an assembly accurately reproduces TE abundance We made the scripts for computing our novel quality metrics and for visualizing the quality along piRNA clusters publicly available https:// sourc eforge.net/proje cts/cusco quali ty/. We additionally provide the sequences flanking piRNA clusters, a manual and a walkthrough.

| Optimizing the assembly strategy
Next, we aimed to identify an assembly strategy that enables us to generate high-quality assemblies of the piRNA clusters of the D. melanogaster strain Canton-S. At first, we evaluated the performance of four different long-read assemblers, which rely on slightly different algorithms. miNiasm (Li, 2016) (Table S2).
To investigate the influence of coverage on assembly quality, we evaluated the performance of each assembler with several different coverages. Reads were randomly subsampled to coverages ranging from 20 to 150× (Figure 3). Note that a minimum coverage of 20× was required for caNu and wtdbg2. To assess the quality of the assemblies, we combined our novel quality metrics with classical metrics (NG50, BUSCO and assembly length) (Figure 3; Table S6). However, we noticed that BUSCO values are very similar among the evaluated coverages and assemblers, suggesting that BUSCO is of limited use for estimating the suitability of assemblies for TE research (Table S6).
When considering relevant metrics (TE landscape metrics, NG50, CUSCO, assembly length, CQ and ScQ), we found that the quality of the assembly depends on the coverage but not the assembler (ANOVA comparing linear models; model1: metric, coverage, assembler; model2: metric, coverage; model3: metric, assembler; model1 versus model2 p = .47; model1 versus model3 p = .036). When solely considering NG50 and CUSCO as metrics, the assembler (but not the coverage) had a significant influence on the assembly quality (ANOVA comparing linear models; model1 vs. model2 p = .0004; model1 vs. model3 p = .11). This indicates that the quality of assemblies depends on the assembler and the coverage. Interestingly, the best assemblies were not necessarily obtained when all reads were used ( Figure 3). For example, caNu and flye yielded the largest NG50 with a coverage of 100x (Canu 100x = 8.1 Mbp, Canu 150x = 3.6 Mbp, Flye 100x = 17.2 Mbp; Flye 150x = 10.5 Mbp) and miNiasm the best representation of TEs at a coverage of 50x (miniasm 50x = 1.01, mini-asm1 50x = 1.11). Based on our novel quality metrics (abundance, SNPs, IDs, CUSCO, CQ and ScQ), caNu and miNiasm outperformed wtdbg2 and flye at all evaluated coverages (Figure 3; Figure S6). At most coverages, caNu captured the TE abundance more accurately than miNiasm, flye and wtdbg2 (Figure 3). Assemblies generated with caNu mostly had the highest CUSCO values (Figure 3), where up to 80% of the piRNA clusters were contiguously assembled with coverages ranging from 100× to 150×. Furthermore, caNu generated the most reliable assemblies of piRNA clusters (average CQ and ScQ values; Figure S6). Although flye yielded the highest NG50 values, it also generated the shortest assemblies ( Figure 3). The caNu assemblies were the largest at most coverages and showed intermediate NG50 values (Figure 3). Overall, we conclude that caNu yielded the most contiguous (highest ungapped-CUSCO) and the most reliable (highest CQ and ScQ) assemblies of piRNA clusters (Figure 3). For the remainder of this manuscript, we thus relied on assemblies generated with caNu.
When reads are randomly sampled, large portions of the data will not be used for the assembly. These unused data may, however, still contain long reads that could be useful for improving the quality of assemblies, for example by bridging gaps between contigs. Thus, we asked if the assembly quality could be further enhanced by sampling the longest reads instead of a random subset. To test this, we sampled subsets of the longest reads with coverages ranging from 20× to 150× ( Figure S7). The mean read length of these subsets ranged from 25,051 bp with 20× coverage to 7146 bp with 150× coverage ( Figure S7a). Canu assemblies based on the longest reads usually have higher NG50 values than assemblies based on random reads ( Figure S7b). The largest NG50 values were obtained when a coverage of 100x was used ( Figure S7c). Interestingly, CUSCO values were consistently highest for assemblies generated with the longest reads ( Figure S7c), while the coverage had little influence on the quality of the assembled piRNA clusters (average CQ and ScQ; Figure S8). The three TE landscape metrics (abundance, SNPs, IDs) revealed little differences between assemblies generated with random reads and the longest reads ( Figure S9).
Finally, we were interested in whether CUSCO values could be further improved by using de-novo scaffolding with Hi-C data As scaffolds usually contain gaps of unknown size between the contigs (mostly indicated by 100 'N' characters), we calculated the gapped-CUSCO ( Figure S7d).
Despite a substantial increase in NG50 values (145-1033%; Figure S10), scaffolding with Hi-C data only moderately improved the CUSCO values (3.5-20%; Figure S7d). This improvement was most pronounced at low coverages, where CUSCO values were quite low before scaffolding. We note that the clusters scaffolded with Hi-C contained gaps, that is missing sequences, mostly of unknown size (see below). Thus, it is crucial to distinguish between gapped-and ungapped-CUSCO to assess the quality of an assembly. As expected, other quality metrics, such as BUSCO and the three TE landscape metrics, were not influenced by Hi-C-based scaffolding (Table S5).
In summary, we found that our novel metrics are useful for assessing the quality of assemblies. Depending on the choice of the investigated regions (number and complexity), CUSCO may be a sensitive metric that identifies quality differences among assemblies not found by other metrics. With long reads and an optimized assembly strategy, up to 81% of the piRNA clusters may be contiguously assembled in D. melanogaster. Especially assemblies based on caNu and a subset of the longest reads (100× coverage) had a high quality. Finally, we found that Hi-C data were of limited use for assembling piRNA clusters.

| Influence of segregating polymorphisms on assembly quality
Based on Canton-S, we showed that long reads enabled us to generate high-quality assemblies of piRNA clusters. However, Canton-S is highly isogenic, having few segregating polymorphisms F I G U R E 3 Influence of the assembly algorithm (caNu, miNiasm, wtdbg2, flye) and the coverage on the quality of assemblies. Results are shown for our novel TE-centred quality metrics (CUSCO, abundance, SNPs, IDs) as well as classic quality metrics (NG50, BUSCO). Dashed lines indicate optimal performance  Figure S11a). We were interested whether piRNA clusters may also be reliably assembled for a less isogenic strain. We relied on the D. melanogaster strain Pi2, which is frequently used in TE research, for example, to assess the extent of P-element (a DNA transposon)induced infertility in females (O'Hare et al., 1992;O'Hare & Rubin, 1983;Srivastav et al., 2019). Pi2 has substantial numbers of segregating SNPs on several chromosomes ( Figure S11a). We first generated a high-quality data set for Pi2: 199x ONT long reads (mean read length ≈ 8 kb), 40× of Illumina PE data and 260× coverage Hi-C data (Table S1). An assembly of Pi2 was generated with our previously established strategy: 100x of the longest ONT reads (mean read length 19,219 bp) were assembled with caNu, the assemblies were subject to multiple rounds of polishing, and Hi-C data were used for scaffolding (Table S5). Based on our novel quality metrics, the assemblies of Canton-S and Pi2 are mostly of similar quality (apart from CQ and ScQ values, which have slightly lower values in Pi2; Table S5).
We noticed that the Pi2 assembly is substantially larger than the Canton-S assembly (12.5% larger, Table S5). This difference in assembly size might be due to the polymorphisms segregating in Pi2, where the assembly algorithm could have generated several contigs (e.g. a contig for each homologous chromosome) for polymorphic regions (Pryszcz & Gabaldón, 2016). To test this hypothesis, we sliced assemblies into nonoverlapping fragments of 1kb, aligned them to the reference sequence and calculated the average coverage for 100 kb windows ( Figure S11a). Uniquely assembled regions will have a coverage of 1, whereas regions assembled multiple times will have a coverage >1. We observed many multipleassembled regions for the Pi2 assembly ( Figure S11b) that largely overlap with polymorphic regions ( Figure S11c). Pi2 had more multiple-assembled regions than Canton-S (paired Wilcoxon ranksum test, V = 370610, p ≤ .0001). Segregating polymorphism in Pi2 thus led to redundantly assembled contigs which likely account for the large assembly size of Pi2 (Table S5). Interestingly, we did not observe any redundant assemblies of piRNA clusters for Pi2 (we tested if both sequences flanking piRNA clusters map to multiple contigs). This absence of redundant clusters is likely due to the fact that the vast majority of the piRNA clusters lie in regions with few segregating polymorphisms in Pi2 ( Figure S12). This may however not necessarily hold for other strains.
Polymorphic regions are also problematic as it is unclear on how to deal with heterozygous TE insertions. We therefore searched for heterozygous (0.25 ≤ frequency ≤ 0.75) structural variants (SVs) in our assemblies, using sNiffles (Sedlazeck, Rescheneder, et al., 2018). In total, we identified 9 heterozygous indels with a minimum size of 1 kb in our Canton-S assembly and 108 in our Pi2 assembly ( Figure S11d). A blast search revealed that 66.67% and 84.26% of these SVs in Canton-S and Pi2, respectively, were due to TEs. Two of these heterozygous TE SVs were found in piRNA clusters of Pi2 and none in piRNA clusters of Canton-S. Due to these difficulties, we recommend to use highly isogenic strains for assembling piRNA clusters.

| Finalizing assemblies
To provide chromosome-scale assemblies of Pi2 and Canton-S to the community, we manually broke up misassemblies ( Figure S13) and performed reference-based scaffolding with ragoo . Reference-based scaffolding raised the gapped-CUSCO to 95.3 for Canton-S and to 97.7 for Pi2 but had little effect on other quality metrics (Table S5). An overview of the quality of the final assembly, including the quality at the different assembly steps, can be found in Table S5. The assemblies of Canton-S and Pi2 are available at NCBI (PRJNA618654).

| Composition of piRNA clusters
Next, we investigated the quality and composition of the assembled piRNA clusters in more detail. We compared piRNA clusters between our chromosome-scale assemblies of Canton-S and Pi2 to the reference genome. Assembly errors, but also presence/absence polymorphism of TE insertions in piRNA clusters, could lead to vast size differences of clusters among assemblies. Thus, we first inves- When we estimated the quality of the assembled piRNA clusters using CQ and ScQ, we observed considerable differences among the clusters in both assemblies (Figure 4b,c). As expected, piRNA clusters with assembly gaps have low CQ values (Figure 4b). By contrast, assembly gaps had little impact on the ScQ values (Figure 4c).
Investigating the base coverage and the soft-clip coverage along each position of some clusters with low and high ScQ values revealed potential assembly issues at some positions of clusters with a low ScQ but not in the clusters with a high ScQ ( Figure S14).

Taken together, this illustrates that both CQ and ScQ values help
to identify clusters with potential assembly issues. However, solely an analysis of the base coverage and the soft-clip coverage along clusters will provide detailed information about the abundance and position of potential assembly problems. For comparing the composition of piRNA clusters, it is therefore necessary to consider the annotations of the clusters as well as the quality along clusters. We illustrate this approach with 42AB, one of the largest contiguously assembled clusters in D. melanogaster ( Figure 5). We computed the base coverage and the soft-clip coverage for 42AB in Canton-S and Pi2 ( Figure 5). We also annotated TEs with repeatmasker (Smit et al., 2015) and identified sequence similarity between our assemblies and the reference genome with blast ( Figure 5; Altschul et al., 1990).
The base coverage and the soft-clip coverage of 42AB in Canton-S are mostly within the 99% quantiles of BUSCO genes, which suggests that this assembly is of high-quality ( Figure 5). However, the soft-clip coverage and to a lesser extent also the base coverage of 42AB in Pi2 is elevated at the end and in the central simple-repeat region, indicating potential assembly problems in these regions ( Figure 5). When searching for causes of these potential assembly problems with sNiffles, we found a P-element insertion with a frequency of 100% at a site of the elevated soft-clip coverage in Pi2 ( Figure 5), demonstrating the utility of our novel quality metrics. We did not find a cause for the elevated soft-clip coverage in the central regions of 42AB in Pi2. Most TE insertions are shared between the three strains, and large synteny blocks, frequently involving several TE insertions, can be found ( Figure 5). Nevertheless, we also found differences among the three strains ( Figure 5). Most notably, a 26kb region -involving the X-element, GATE, Max-element and rover -was duplicated in Pi2 ( Figure 5). Relative to the reference genome, we also found several TE presence/absence polymorphism in both strains (7 in Pi2 and 11 in Canton-S; Figure 5). Interestingly, most of these polymorphic TEs show little divergence from the consensus sequence (<1%; Figure 5), which suggests that these polymorphisms are due to recent TE insertions into 42AB. These polymorphisms are largely in regions with inconspicuous base coverage and soft-clip coverage, which suggests that they are not due to assembly mistakes. Apart from Chimpo, which was identified using RepBase (Bao et al., 2015), all TEs identified in the cluster 42AB were present in the consensus sequences of TEs in D. melanogaster (version 10.01; Quesneville et al., 2005).
Finally, we validated several of the polymorphic TE insertions in piRNA clusters with PCR. In the cluster 42AB, we confirmed 11 out of the 14 tested polymorphic TE insertions, including the missing P-element insertion and the large duplication in Pi2 (7 present in Pi2; 3 present in Canton-S; 7 present in Iso-1 of which three are shared with Pi2; Figure S15; Table S3). In other piRNA clusters, we confirmed 20 out of the 22 tested polymorphic TE insertions (12 present in Pi2; 10 present in Canton-S; Figure S15; Table S3).
We conclude that our assembly strategy yields contiguous sequences of many piRNA clusters. Furthermore, our novel quality metrics may be used to identify the location of potential assembly problems in piRNA clusters.

| Extending our approach to different species
To demonstrate the generality of our approach, we extended our metrics to different species. We first tested the TE landscape metrics with a short-and a long-read based assembly (observations) of the same human individual (Korean reference genome: KOREF1.0 (Cho et al., 2016) and PB_62x (Kim et al., 2019)). The expected TE abundance and diversity was derived from the short-read data (Cho et al., 2016). The TE landscape metrics are based on 1063 TE families.
We did not compute the ID metric as solely 39 TE families possessed IDs in the 'expected' data set. Similarly to Drosophila, the long-read assembly of humans captures the abundance and diversity of TEs better than the short-read assembly (abundance: long-read = 0.898, short-read = 0.824; SNPs: long-read = 1.051, short-read = 1.056; Figure S16).
To extend CUSCO to humans, we designed flanking sequences for 168 piRNA clusters (Sarkar et al., 2014) and obtained unique flanking sequences for 136 of them. We applied CUSCO to 11 F I G U R E 5 The sequence of the cluster 42AB in our assemblies compared to the reference genome. The TE annotation (yellow-red gradient indicates similarity to the consensus sequence of the TE) and sequence similarity to the reference genome (grey gradient indicates the degree of similarity) are shown. PCR validated presence/absence polymorphisms of TEs or SVs are marked with a '*'. A TE insertion missed in the assembly is shown in green. The base coverage and soft-clip coverage are shown for Canton-S (top) and for Pi2 (bottom). The 99% quantiles based on BUSCO genes are shown as dotted lines. Note that the soft-clip coverage and to a lesser extent the coverage is elevated at the site of the missing TE insertion and in the simple-repeat region, indicating possible assembly problems publicly available human assemblies, where five are different versions of the same Korean individual (Cho et al., 2016;Kim et al., 2019). Although BUSCO values were nearly identical among the assemblies the CUSCO values showed more variation, where especially the ungapped-CUSCO revealed marked differences among the assemblies (Kolmogorov-Smirnoff test; u In summary, we argue that our quality metrics can be readily extended to diverse species and that CUSCO in particular is a sensitive metric detecting differences in assembly quality that are not easily detected by classic metrics such as BUSCO.

| DISCUSS ION
Here, we showed that long-read sequencing technologies enable us to generate high-quality assemblies of piRNA clusters. With an optimized assembly strategy, more than 80% of the piRNA clusters in D. melanogaster may be assembled, which can be increased up to 98% with scaffolding approaches.

| Novel quality metrics
Since current metrics of assembly quality largely ignore TEs and piRNA clusters, we introduced several novel quality metrics.
With three metrics, we first estimate whether an assembly accurately captures the TE landscape (abundance, SNPs and IDs) of an organism. These metrics may be viewed as a general control, since it is unlikely that repeat rich regions, like piRNA clusters, have been accurately assembled when TEs are poorly represented in the assembly. Unfortunately, the real TE landscape is not known for any organism. However, we argue that Illumina raw reads may be used to derive a useful approximation of the expected TE landscape.
Assuming that reads are more or less randomly distributed over the genome, and that sequencing errors are largely random within reads, this assumption should mostly be valid. Sequencing errors can be largely eliminated from the analysis of the abundance of SNPs and IDs by using a minimum allele frequency (Kofler, Orozco-terWengel, et al., 2011). Here, we used a minimum allele frequency of 2% for SNPs and IDs. In case more stringent criteria are required, a higher threshold may be used. The coverage will fluctuate over the genome, which could affect estimates of TE abundance. Especially, the GCbias, where regions with a high GC content have an elevated coverage (Minoche et al., 2011), could lead to overestimating the expected abundance of TEs with a high GC content. However, since we sum the average coverage over many different insertions of a TE family, with insertion sites in diverse genomic backgrounds (with varying GC contents), the influence of the GC-bias and of stochastic coverage fluctuations should be minimized by our approach. Furthermore, since we rely on the slope between the expected and the observed TE abundance, which is based on many TE families with different GC contents, the influence of the GC bias should be further reduced. In agreement with this, we did not find any correlation between GC content and TE abundance in the raw reads ( Figure S17). Moreover, we solely found a small but nonsignificant difference in GC content among TEs that are well-represented in genomes compared to TEs that are not well-represented ( Figure S17). Finally, it is reassuring that an assembly based on long reads captures the expected TE landscape more accurately than an assembly based on the short reads, which have been used for estimating the expected TE landscape ( Figure 1).
Apart from sequencing biases, also biases occurring during data analysis, such as mapping and quantifying of reads may occur. Since we use the same pipeline for the raw reads (expectations) and the artificial reads derived from an assembly (observations), these biases should largely be eliminated.    Figure 3; Figure 6; Table S6).
It is important to distinguish between ungapped-and gapped-CUSCO values. Clusters containing gaps likely miss some sequences, including TE insertions, which prevents a comprehensive analysis of the composition of clusters. It is thus most important to maximize ungapped-CUSCO values. However, scaffolding algorithms, which introduce gaps between adjacent contigs, have been used to generate most publicly available assemblies ( Figure S18). In these assemblies, many piRNA clusters may contain gaps. To gain a complete picture of piRNA clusters in an assembly, we thus recommend evaluating both CUSCO values (our script computes both).
Identification of the sequences flanking piRNA clusters requires a reference genome. Hence, CUSCO can only be used with species with a reference genome and an annotation of piRNA clusters. But even for species with a reference genome, it will not be feasible to identify suitable flanking sequences for all piRNA clusters (e.g. clusters at terminal ends of contigs/chromosomes).
One limitation of CUSCO is that the sequences flanking piRNA clusters need to be identified for each species separately. However, once sequences flanking piRNA clusters are identified, CUSCO values can be readily computed for many different assemblies ( Figure 6; Figure S18). Although we primarily designed CUSCO for species with piRNA clusters, we showed that the CUSCO approach can be extended to any regions of interest such as KEE regions in A. thaliana ( Figure 6c).
Since CUSCO ignores the actual sequence within the piRNA clusters, complete clusters may yet contain assembly errors, for example if internal regions are missing in the assembly. Therefore, we suggested that the base-coverage heterogeneity and the soft-clip coverage are useful metrics to identify potential assembly problems in piRNA clusters (Figures 5 and 6b). To derive the null expectations for these two metrics, we relied on complete BUSCO genes.
Complete BUSCO genes are ideal for this task: first, BUSCO genes are conserved single copy genes, which makes them relatively easy to assemble, even with short reads and a low coverage (Figure 2b).
Second, complete BUSCO genes provide a high-confidence set of genes that contain no or few assembly errors (since the ORFs are mostly complete). Third, BUSCO values are usually computed as a standard metric to assess the quality of novel assemblies. The list of complete BUSCO genes is provided as an output of the BUSCO pipeline. Based on the base coverage and the soft-clip coverage, potential assembly errors in a cluster can be identified by coverage values transgressing the quantiles computed from the BUSCO genes ( Figure 5; Figure S14). To roughly summarize the assembly quality of each piRNA cluster with representative numbers, we introduced the

ScQ and CQ values.
Although the soft-clip coverage of many piRNA clusters approaches the soft-clip coverage of BUSCO genes, the base-coverage heterogeneity of piRNA clusters is always higher than of BUSCO genes, which explains why the CQ values are usually smaller than ScQ values and rarely approach optimal values (i.e. ≥1.0; Figure 4; Figure S19). Repetitive regions, such as found within piRNA clusters, that is an assembly with high CQ/ScQ values but a low BUSCO is likely of low quality. This emphasizes that our metrics should not be interpreted in isolation but rather be used in combination with classic metrics such as BUSCO and NG50. However, we think the main use of CQ and ScQ values is to identify clusters with potential assembly problems within a given assembly. Finding such outlier clusters is robust to varying numbers of BUSCO genes as the null expectation for computing CQ and ScQ is identical for all clusters within an assembly.
Our novel quality metrics may not only be used to compare the quality of available assemblies but may also serve as a guide during the assembly procedure, for example, to identify the most suitable assembly algorithm. Our metrics should thus help to generate and to identify assemblies having a high fraction of correctly assembled piRNA clusters (or other regions of interest). In unison with standard assembly metrics such as NG50, BUSCO and the total size of assemblies, our metrics should help to generate and identify assemblies with high contiguity and reliability.

| Assembly strategy
We showed that high-quality assemblies of piRNA clusters can be obtained if: (a) the sequenced strains are isogenic; (b) long reads are available; (c) suitable assemblers, such as caNu are used with an optimized coverage and read length; (d) assemblies are polished using short and long reads; and (e) a scaffolding approach is used.
Isogenic strains are necessary to avoid redundant contigs and error-prone assemblies of piRNA clusters ( Figure S11; Figure 5).
However, it is possible that future tools generate high-quality assemblies of nonisogenic strains. For example, phased assemblers, such as falcoNphase (Kronenberg et al., 2018), may yield a separate contig for each homologous chromosome. For these algorithms, segregating polymorphism could even be an advantage as polymorphisms may help to distinguish between homologous chromosomes. We also found that long reads allow us to generate high-quality assemblies of piRNA clusters. Assemblies generated by any of the two major long-read technologies, ONT and PacBio, have a high quality ( Figure S18).
We found that caNu yields high-quality assemblies of piRNA clusters and that the TE landscape is most accurately reproduced.
The high quality of assemblies generated by caNu was also noticed in several previous works (Jayakumar & Sakakibara, 2017;de Lannoy et al., 2017;Solares et al., 2018;Wick & Holt, 2019). The best assemblies were obtained when solely a subset of the long reads was used for an assembly with caNu, that is 100x coverage with the longest reads. We suspect that this may be related to an algorithmic assumptions about the corrected error rate, which is coverage-dependent and governs the overlap among reads (see Canu manual https:// canu.readt hedocs.io/en/lates t/param eter-refer ence.html).
Since long reads have a high error rate, polishing of assemblies using long or short reads is usually recommended (Rice & Green, 2018;Sedlazeck, Lee, et al., 2018). Interestingly, polishing also increased the fraction of contiguously assembled piRNA clusters as well as the representation of the TE abundance and diversity (Tables S4 and S5).
Scaffolding with Hi-C slightly increased the number of assembled piRNA clusters (using gapped-CUSCO) but had little influence on the representation of the TE landscape (Figure S7 Table S5).
Nevertheless, scaffolding approaches may still be useful for TE research, since scaffolding enables generating chromosome-sized sequences, which could be important when the genomic context of a TE insertion is relevant (e.g. whether a TE or piRNA cluster is close to a telomere).
Despite our optimized assembly strategy, about 19% of the piRNA clusters were not contiguously assembled (Table S5, after polishing). Additionally, manual curation of the final assemblies was necessary to avoid misassemblies ( Figure S13). This demonstrates that assembly strategies may still be improved. Especially, promising may be further advances in the length of reads (e.g. by improvements in library preparation protocols), their accuracy (e.g. long high-fidelity reads (Wenger et al., 2019)) and in algorithms generating phased assemblies, which could yield a separate contig for each homologous chromosome. Phased assembly algorithms may even allow us to use outbred strains. Furthermore, such phase assemblers avoid the central problem of assemblies of diploid organisms; that is, that two potentially distinct sequences (i.e. the homologous chromosomes) need to be represented as a single one.
Our novel quality metrics may be used to generate high-quality assemblies of piRNA clusters and thus allow us to address some of the central open questions in TE biology, such as the evolutionary dynamics of piRNA clusters.

ACK N OWLED G EM ENTS
We thank Kirsten-André Senti for advice and providing the D. melanogaster strain Iso-1, Christos Vlachos for sharing scripts, Elisabeth Salbaba for technical support and all members of the Institute of Population Genetics for feedback and support. This work was supported by the Austrian Science Foundation (FWF) grants P30036-B25 to RK and W1225.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no competing interests.

AUTH O R CO NTR I B UTI O N S
RK, FS and FW conceived this work. FS and OC generated the data.
FW performed PCR. FS and FW analysed the data. RK and FW provided software. RK, FS and FW wrote the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
Scripts for computing our quality metrics, including a manual and a walkthrough, are available at https://sourc eforge.net/proje cts/ cusco quali ty/. We recommend to obtain the scripts via subversion