K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.

sequences of neighboring genes. Moreover, TEs are increasingly receiving attention in studies tackling plant pangenomes (e.g., Gordon et al., 2017). According to the Wicker classification, plant TEs can be classified either as Class I RNA retrotransposons or Class II DNA transposons (Wicker et al., 2007). Software resources such as RepeatMasker (RM) (Smit et al., 2015), RepBase (Bao et al., 2015), or RepetDB (Amselem et al., 2019), which are typically used to annotate TEs and other repeats in plant genomes, use the Wicker classification rules and repeat libraries (Lerat, 2010). These libraries can be generic, such as RepBase, which is available for subscribers only, or customized for a genome of interest with RepeatModeler (RepMod) (Flynn et al., 2020). These repeat annotation strategies can take up to several days on a computer cluster, depending on the genome size, and often mask disease resistance (R) genes, which are of great interest in plant breeding (Bayer et al., 2018).
In addition to the intrinsic biological value of TEs, the annotation of repeats can be used to estimate assembly quality (Wierzbicki et al., 2020) as an alternative to gene completeness (Van Bel et al., 2019). For other genomic analyses, the bulk of repeated sequences may disrupt common computational genomic analyses and are thus often masked out, without any classification attempt. For instance, wholegenome alignment, promoter analysis, and the construction of graph genomes require the computation of frequency tables of k-mers, which are nucleotide words of size k. If repeated sequences are not masked, the frequency tables are severely biased and can affect the results obtained (Hickey et al., 2020). Although annotation approaches based on sequence similarity are computationally expensive, k-mer masking strategies are orders of magnitude faster (Beier et al., 2020;da Cruz et al., 2020;Girgis, 2015;Kurtz et al., 2008) and, in our experience, are much better for prepare whole-genome alignments of barley (Hordeum vulgare L.) and wheat (Triticum aestivum L.) cultivars via LASTZ (Harris, 2007).
In this study, we benchmarked a two-step approach for annotating repeated sequences in plants. First, repeats were called by k-mer counting with the Repeat Detector (Red). Second, the discovered repeated sequences were annotated by sequence alignment to a newly curated metacollection of repeats called nrTEplants. We compared this approach with the conventional RM pipeline on a set of 20 angiosperms from Ensembl with nrTEplants, REdat (Nussbaumer et al., 2013) and custom RepMod libraries. We then compared their performance and discuss the results. The nrTEplants library is bundled with documentation on how to update it and scripts to mask and annotate plant genomes, enabling interoperability, reuse, and reproducible analyses (Wilkinson et al., 2016).

Core Ideas
• Control Pfam domains minimize unrelated coding sequences in repeat libraries. • Repeat calling by k-mer counting with Red does not preferentially mask NLR genes. • Repeats called by Red can be efficiently classified by sequence similarity with minimap2.

Plant repeat libraries
We searched the literature for plant-specific libraries of repeated sequences and selected those in Table 1. Although some are specific for a species or repeat family, others comprise repeats from mixed species, such as REdat from PlantsDB (Nussbaumer et al., 2013) or RepetDB (Amselem et al., 2019). FASTA files with nucleotide sequences of repeats were downloaded from the indicated URLs or obtained from the authors.

Sequence clustering
Transcripts and TE sequences were clustered with GET_ HOMOLOGUES-EST version 10042020 (Contreras-Moreira

Negative control: Pfam domains of disease resistance genes
For the identification and curation of Pfam domains encoded by disease resistance (R) genes, the following steps were per-formed. First, a set of 153 protein sequences encoded by reference R genes (i.e., cloned and/or with robust evidence) was retrieved from http://www.prgdb.org/prgdb (Osuna-Cruz et al., 2018). Second, the program hmmscan from HMMER Version 3.2.1 (Eddy, 1998) was used for initial Pfam domain identification (Version 32, default settings), yielding a total of 60 Pfam hidden Markov models. The observed order and combinations of Pfam domains were retrieved. Third, the proteins of six plant species (A. thaliana, B. distachyon, G. max, H. annuus, H. vulgare, and T. aestivum) containing at least one of the 60 Pfam domains previously identified were retrieved from https://plants.ensembl.org/biomart/martview (Kinsella et al., 2011). These proteins were subsequently filtered, retaining only those with the ordered combinations of Pfam domains observed in the reference R proteins, and were considered as potential R proteins (428 in A. thaliana, 577 in B. distachyon, 1,008 in G. max, 849 in H. annuus, 838 in H. vulgare, and 3,607 in T. aestivum). From the initial set of Pfam domains, only 43 were consistently identified in our final panel of potential encoded proteins of R genes and used as a negative control. Note that one of them (PF02892, zf-BED) is often found in transposases (Mistry et al., 2021). The list is available at https://github.com/Ensembl/plant_tools/blob/ master/bench/repeat_libs/control_neg_NLR.list.

De novo annotation of nucleotide-binding and leucine-rich repeat immune receptor genes
The NLR-annotator software package (Steuernagel et al., 2020) was used for de novo annotation of nucleotide-binding and leucine-rich repeat immune receptor (NLR) genes, which are the most abundant R genes characterized to date, in whole genome sequences. Briefly, the 20 plant genomes were dissected into fragments 20 kb in length, with 5 kb overlaps, via the ChopSequence.jar routine. The cut sequences were then scanned to find NLR-associated sequence motifs with the NLR-Parser.jar command. Finally, NLR-Annotator.jar was used to integrate the annotated motifs and retrieve the actual NLR loci in BED format. In order to compute intersections with repeats, only NLR loci with an overlap of >50 bp were considered. Moreover, to account for the fact that the tested masking strategies covered different fractions of the genome, odd ratios of NLR masking were computed via Equation 1: where OR is the odds ratio, NLR masked is the masked NLR space, Gen masked is the masked genome space, NLR is the NLR space, and Gen is the genome space.

Masking and annotation of repeats in plant genomes
RepeatMasker Version 4.0.5 and a fork of Repeat Detector (Red) Version 2.0 adapted for Ensembl, available at https://github.com/EnsemblGenomes/Red, were used to call repeats in plant genomes in the libraries REdat Version 9.3 and nrTEplanst Version 0.3. In addition, RepeatMasker Version 4.1.2-p1 was also run to call repeats with custom repeat libraries produced by 20 parallel jobs in RepeatModeler-2.0.2a (Flynn et al., 2020). Note that custom libraries were obtained for only 10 species, as the remaining RepMod jobs were killed after 7 d in a computer farm. RepMod repeat coordinates were converted to BED format and overlapping intervals were merged. Low complexity sequences were called with dustmasker Version 1.0.0 (Morgulis et al., 2006). Tandem repeats were discovered with trf Version 4.0 with the parameters 2 5 7 80 10 40 500 -d -h (Benson, 1999). Red was called from the script https://github.com/Ensembl/plantscripts/blob/master/repeats/Red2Ensembl.py, which can run several sequences in parallel and feed the results into a Ensembl core database (Stabenau et al., 2004). In addition, minimap2 version 2.17-r974-dirty (Li, 2018) was used to annotate the repeats called by Red with sequences from nrTEplants as follows: minimap2 -K100M -score-N 0 -x map-ont nrTEplants. Minimap2 is called from the script https://github.com/Ensembl/plantscripts/blob/master/repeats/AnnotRedRepeats.py, which parses its output to annotate the repeats. By default, only repeats with a length of >90 bp are processed. Transposable element classification terms are parsed from the FASTA header of the library after a hash (#; e.g., RLG_43695:mipsREdat_9.3p_ALL#LTR/Gypsy). Elapsed runtime and RAM consumption was measured with the command time -v tool.
Genomic intersections among repeated sequences called by Red and RM, and genomic features (i.e., protein-coding genes, exons, proximal downstream and upstream 500-bp windows, and NLR loci) were computed with Bedtools (Version 2.26.0) (Quinlan & Hall, 2010) using bedtools intersect -a bed/genes.bed -b repeat.bed -sorted -wo. To avoid redundancy, exons were extracted from Ensembl canonical transcripts (see http://plants.ensembl.org/info/website/glossary. html). When we retrieved downstream and upstream genomic intervals, intersecting neighbor genes were first subtracted to eliminate any potential coding sequences.

K-mer analysis of repeats in downstream and upstream windows
Repeats overlapping proximal downstream or upstream 500bp windows were extracted via bedtools intersect analysis and the sequences were cut with bedtools getfasta. Canonical kmers with k = [16,21,31] were counted with Jellyfish Version 2.3.0 (Marçais & Kingsford, 2011) by the commands jellyfishlinux count -C -m K -s 2G -t 4 and jellyfish-linux dump -L 20.

Enrichment of Pfam domains
Enrichment was computed by the R function fisher.test (R Core Team, 2020) and Pfam domains (Mistry et al., 2021) were retrieved by Recipe B4 of https://github.com/Ensembl/ plant-scripts (Contreras-Moreira et al., 2021). Pfam domain counts for the complete proteome were used as the expected frequencies.
Only genes with an overlap of >50 bp and domains with adjusted false discovery rates (p < .05) were considered.

Construction and benchmarking of a nonredundant library of repeats: nrTEplants
A set of plant TE libraries and annotated repeats from selected species, listed in Table 1 plus transcript sets from the best functionally annotated plant species in Ensembl were curated and their TE classification terms uniformized. Next, they were merged and clustered (95% identity, 75% coverage of shortest sequence). From the resulting 994,349 clusters, the 174,426 clusters contained TE sequences and were six-frame translated and assigned Pfam domains. Of these, a subset of 8,910 mixed clusters comprising both TE and transcript sequences, and required further processing (see the example in Supplemental Figure S1). After empirical assessment, we decided to take only clusters (a) containing sequences from at least six different TE libraries (six replicates), which eventually left out Rosa TE repeats; and (b) those with a fraction of sequences marked as a 'potential host gene' in RepetDB below 0.00. The resulting nrTElibrary contained 171,104 sequences (see Supplemental Table S1 and Supplemental Table S2). Note that different cut-off values might have been selected with different input sequences or control sets. For example, increasing the number of replicates equates to computing an intersection set. Instead, to get a union set, the cut-off will need to be lowered.
In order to benchmark the newly constructed library, we compiled a positive control comprising 22 Pfam domains found in TEs, and a negative control: a list of 43 Pfam domains found in disease resistance NLR genes. Among these controls, we observed 20 true positives, 2 false negatives, 36 true negatives, and 2 false positives, yielding a sensitivity of 0.91 and a specificity of 0.95. The nrTEplants library can be obtained at https://github.com/Ensembl/plant-scripts/ releases/tag/v0.3. A step-by-step guide on how to produce a nonredundant repeat library, including sample files with the control Pfam domains, is available at https://github.com/ Ensembl/plant_tools/tree/master/bench/repeat_libs.

Masking repeats within plant genomes
Twenty plant genomes were selected from Ensembl (Howe et al., 2020) to benchmark the repeat calling strategies. These are listed in Table 2 next to the genomic fraction of repeats reported in the literature and their guanine-cytosine content. All these genome sequences were annotated with RM (Smit et al., 2015) with several repeat libraries (nrTEplants and REdat) (Nussbaumer et al., 2013) and species-specific custom libraries (RepMod). In addition, the fraction of repeats called by Red, based on k-mer enrichment, is also shown. Note that Red automatically selected k values from 13 to 16 as the genomes increased in length.
In Figure 1, the resulting percentages of repeated sequences are plotted next to the values reported in the literature. The median difference between the REdat repeated fraction and the literature reports is 26.5%. This number is 9.8% for nrTEplants, 4.3% for Red, and 6.3% for RepMod (over 10 genomes). These results suggest that Red can successfully mask any genomes without previous knowledge of the repetitive sequence repertoire of a species. As shown in Supplemental Table S3, Red-masked fractions were also consistent among cultivars of the wheat pangenome. Moreover, repeats called by Red generally overlapped sequences masked with REdat (66.6%), nrTEplants (73.8%), and RepMod (94.1%) (see Supplemental Table S4). In contrast, the overlap with low complexity regions (in dustmasker) and tandem repeats (in trf) is small (2.8% and 4.9%, respectively). Table 3 summarizes the number and length of repeats called by all the strategies tested. We observed that Red called more repeats than nrTEplants and REdat but less than custom Rep-Mod libraries (a median of 845 per Mbp, compared with 391 for nrTEplants, 221 for REdat, and 961 for RepMod). In terms of the sequence length of the shortest contig at 50% of the total sequence length, the performance depended on the species, but it seems that repeats called by RepMod are generally shorter. Figure 2 summarizes how the called repeats overlapped with genes, exons, and 500-bp windows upstream and downstream. It can be seen that Red repeats overlapped a larger fraction of the gene space (23.2%) than REdat (12.4%) and nrTEplants (18.8%), as did RepMod repeats (24.4%). When only exons were considered, REdat repeats overlapped 4.1% of these, with nrTEplants, Red, and RepMod behaving similarly (11.6, 11.9, and 11.7%, respectively). The figure also shows that Red and RepMod mask more of the proximal upstream and downstream space, which will probably have a positive impact on k-mer counting strategies for promoter analysis (Ksouri et al., 2021). The analysis in Supplemental Table S5 shows that Red identified four times more kmers with 20+ copies in this regulatory space, which agrees with recent work showing that unidentified TEs are overrepresented in specific regulatory networks (Baud et al., 2019).
In order to check whether the compared approaches masked preferentially genes from certain families, a Pfam enrichment analysis was carried out; this is summarized in Figure 3. It can be seen that RepMod and Red repeats show the least T A B L E 2 Plant genomes from release 49 (September 2020) of Ensembl Plants (Howe et al., 2020) Table S6). As gene annotation is frequently performed after repeat masking, we reasoned this could affect the Pfam enrichment analyses. Therefore, we carried out a complementary analysis where NLR genes were called de novo on the genomic sequences instead of using the Ensembl gene annotation. The results, summarized in Supplemental Table S7, confirm that Red tends to mask fewer NLR genes than expected at genomic scale, with only one species (Trifolium pratense L.) with an odds ratio >1. In contrast, we obtained odd ratios greater than 1 for several species with REdat (n = 7), nrTEplants (n = 12), and RepMod (n = 6 out of 10 species).

Annotating Red-masked repeats within genomes with nrTEplants and minimap2
In the previous analyses, we showed that Red masking is an effective way of calling repeats in plant genomes, comparable with RepMod. Moreover, we observed that nrTEplants behaved better than REdat in most cases. Therefore, we wanted to check whether repeats called with Red could be annotated and classified. For that, we aligned the repeat sequences against the nonredundant nrTElibrary with min-imap2. The results are plotted in Figure 4, where it can be seen that in most species, more than half of the repeat space could be annotated (median: 65.9%). As our library contained F I G U R E 1 Fraction of repeated sequences in plant genomes. Twenty genomes from release 49 (November 2020) of Ensembl Plants were annotated with RepeatMasker (Smit et al., 2015) and the libraries REdat (Nussbaumer et al., 2013) and nrTEplants. The results for 10 genomes masked with RepMod custom libraries are also shown (Flynn et al., 2020). The percentage of repeated sequences is plotted next to the values reported in the literature for those genomes and the fraction of repeats provided by Repeat Detector (Red), based on k-mer enrichment (Girgis, 2015). Species are sorted by genome size from smallest to largest T A B L E 3 Summary of repeated sequences annotated with Repeat Detector (Red) (Girgis, 2015) and RepeatMasker (Smit et al., 2015) with the libraries nrTEplants and REdat (Nussbaumer et al., 2013) and with custom libraries obtained for some species by RepeatModeler (Flynn et al., 2020). Total repeats and N50 is the sequence length of the shortest contig at 50% of the total sequence length (N50) estimates of repeats are shown F I G U R E 2 Fraction of exons, genes, and 500-bp upstream and downstream regions overlapping annotated repeats in plant genomes. Twenty genomes from release 49 (November 2020) of Ensembl Plants were annotated by Red (Girgis, 2015) or RepeatMasker (Smit et al., 2015) with the libraries REdat (Nussbaumer et al., 2013) and nrTEplants. The results for 10 genomes masked with RepMod custom libraries are also shown (Flynn et al., 2020) only TEs, we expected a fraction of the unmapped space to contain simple repeats or satellite DNA. However, in some species, only a small fraction of repeats could be classified. We reasoned this was caused by a repeat consensus not represented in the library. This was confirmed in a separate experiment, where the repeated sequences of olive and R. chinensis obtained from their authors were mapped to Red repeats, as seen in Figure 4 (control). Another positive control was also carried out with sunflower repeated sequences in order to confirm that no valuable repeats had been lost during the construction of nrTEplants. These results indicated that in species where a curated library did not work well, the repeats could be classified by custom collection of repeated sequences for that taxon. As we saw in the previous section, this can also be achieved with species-specific libraries produced with Rep-Mod; however, note that in our tests three-fourths of repeat families discovered by RepMod remained unclassified (see Supplemental Table S8). The results in the previous paragraph were obtained with the default map-ont setting of minimap2. Note that we also tried the map-pab and asm20 settings, but obtained similar results. Red clover (Trifolium pratense) was reanalyzed replacing minimap2 with the BLAST algorithms megablast, dc-megablast, blastn, and rmblastn (Altschul et al., 1997). Compared with the mapped fraction produced by minimap2 (0.4%), a maximum value of 6.1% was obtained with blastn. This modest gain in sensitivity required 1,412 min. The algorithm rmblastn, used by RM, yielded a mapped fraction of 0.7%. We concluded that the alternatives to minimap2 offered little gain at the cost of spiralling computing time.

F I G U R E 3
Enriched Pfam domains of protein-coding genes overlapping repeats. Twenty genomes from release 49 (November 2020) of Ensembl Plants were annotated with Repeat Detector (Red) (Girgis, 2015) and RepeatMasker (Smit et al., 2015) with the libraries REdat (Nussbaumer et al., 2013) and nrTEplants. The results for 10 genomes masked with RepMod custom libraries are also shown (Flynn et al., 2020) F I G U R E 4 Fraction of Repeat Detector (Red) repeats mapped to nrTEplants sequences. Twenty genomes from release 49 (November 2020) of Ensembl Plants were annotated with Red (Girgis, 2015). The resulting repeats were subsequently mapped to the library nrTEplants with minimap2 (Li, 2018), producing the genome fractions shown. Repeats from three species (R. chinensis, O. europaea, and H. annuus) were also mapped to annotated repeats provided by the respective sequencing consortia as a control. Species are sorted by genome size from smallest to largest Figure 5 shows the runtime and RAM required by the twostep protocol presented in this paper, measured on a Cen-tOS7.9 computer using four cores of a Xeon E5-2620 v4 (2.10 GHz) central processing unit. Panels A and B correspond to the first step, Red masking. It can be seen that all genomes tested take less than 40 min to run, with the excep-tion of tetraploid Triticum turgidum L., which took 71 min. The memory consumption was below 20 GB in most cases, but climbed to 22.7 GB and 29.9 GB for Aegilops tauschii Coss. and T. turgidum. Panel C illustrates the runtime of the second step, the mapping of nrTEplants. It can be seen that all plants required less than 27 min, except A. tauschii and F I G U R E 5 Runtime and memory requirements of a two-step repeat annotation protocol based on the Repeat Detector (Girgis, 2015), minimap2 (Li, 2018), and the nrTEplants library. The protocol was tested on 20 genomes from release 49 (November 2020) of Ensembl Plants. Similar values were measured on an Ubuntu box with four-core i5-6600 (3.30 GHz) central processing unit cores T. turgidum, which took 3 and 1 h respectively. The memory consumed by minimap2 was ∼3.8 GB in all cases. A comparison with the data in Supplemental Table S8 indicated that the protocol presented in this paper was up to two orders of magnitude faster than the combination of RepMod and RM, even with only four central processing unit cores.

CONCLUSIONS
The hybrid two-step methodology presented in this paper was tested on 20 angiosperms with genome sizes ranging from 0.12 to 10.46 Gbp. Overall, we observed that Red consistently produced repeated fractions similar to the expected val-ues from the literature. Comparable results were obtained for 10 species analyzed with RepMod custom libraries. The metalibrary nrTEplants, built by Pfam-informed sequence clustering, also showed good performance in most species but failed to recover the expected repeat fraction in cases such as melon (Cucumis melo L.) or sunflower. This observation highlights the problem of using repeat libraries that do not include sequences similar to the genome of interest. This is the most likely explanation for the low masking values observed for REdat, as that library was produced before many of these genomes were available. For that reason, separating the tasks of calling and classifying repeats, as performed here, seems a promising strategy. On the one hand, Red k-mer masking does not have a preference for masking particular protein-coding families, in contrast to repeats annotated with RM using REdat and nrTEplants. In fact, it also behaved better than custom Rep-Mod libraries with respect to NLR genes annotated de novo . On the other hand, Red appropriately masked plant genomes for which no repeat libraries have been curated yet. If there is a need to classify the repeats called by Red, a curated repeat library can be obtained directly from Ensembl Plants (see https://github.com/Ensembl/plant-scripts/blob/master/ repeats/get_repeats_ensembl.sh) or the INSDC archives (see, for example, https://www.ebi.ac.uk/ena/browser/view/ CACTIH01), or by clustering repeats from different sources, as demonstrated in this study. Our protocol took less than 2 h to run and up to 30 GB of RAM, and can use nrTEplants or any repeat library in FASTA format. This is about two orders of magnitude faster than building species-specific custom libraries with RepMod for the species tested in this benchmark. We thus conclude that the approach presented here is an efficient way of annotating repeated sequences in plant genomes.

D A T A A N D S O U R C E C O D E AVA I L A B I L I T Y
The repeat library and the scripts used to mask and annotate the plant genomes, together with the benchmark scripts and data, can be obtained at https://github.com/Ensembl/ plant-scripts.

AU T H O R C O N T R I B U T I O N S
Bruno Contreras-Moreira: formal analysis, funding acquisition, investigation, resources, software, supervision, writing-original draft, writing-review and editing. Carla V Filippi: data curation, investigation, methodology, resources, writing-original draft, writing-review and editing. Guy Naamati: resources, software, writing-review and editing. James E Allen: resources, software, writingreview and editing. Paul Flicek: funding acquisition, resources, writing-original draft, writing-review and editing.

A C K N O W L E D G M E N T S
We are grateful to Doreen Ware and Vasili Sitnik for comments on drafts of this manuscript. We thank the Gramene team for their continual support and cooperation, as well as members of the Ensembl team for developing and maintaining the front-end and back-end software and infrastructure that underpin Ensembl Plants.

C O N F L I C T O F I N T E R E S T Paul
Flicek is a member of the Scientific Advisory Boards of Fabric Genomics, Inc. and Eagle Genomics Ltd.