EXFI: Exon and splice graph prediction without a reference genome

Abstract For population genetic studies in nonmodel organisms, it is important to use every single source of genomic information. This paper presents EXFI, a Python pipeline that predicts the splice graph and exon sequences using an assembled transcriptome and raw whole‐genome sequencing reads. The main algorithm uses Bloom filters to remove reads that are not part of the transcriptome, to predict the intron–exon boundaries, to then proceed to call exons from the assembly, and to generate the underlying splice graph. The results are returned in GFA1 format, which encodes both the predicted exon sequences and how they are connected to form transcripts. EXFI is written in Python, tested on Linux platforms, and the source code is available under the MIT License at https://github.com/jlanga/exfi.


| INTRODUC TI ON
In the last decade, high-throughput sequencing technologies have enabled biologists to unravel the genetic code on a massive scale and at an unprecedented rate. However, sequencing and assembling whole genomes of nonmodel species is still not practical. Therefore, alternative approaches are needed to capture genetic variation. One approach commonly used in the context of population genetics is restriction site-associated DNA sequencing (RAD-Seq; Baird et al., 2008), which returns polymorphic markers at random loci across the entire genome.
Posterior enhancements, such as RAD-Seq followed by sequence capture (Rapture; Ali et al., 2016), have been recently proposed as an efficient and cost-effective approach for genotyping thousands of samples and loci simultaneously (Meek & Larson, 2019).
Another successfully proven and cost-effective approach is to discover SNPs by sequencing both DNA and RNA and subsequently genotype large numbers of individuals (Kumar et al., 2019;Lamichhaney et al., 2012;Montes et al., 2013Montes et al., , 2015Therkildsen & Palumbi, 2017). For these methods, attention is explicitly restricted to transcriptomic SNPs: Those contained inside expressed genes due to their higher functional relevance, rather than intergenic and intronic regions. The combined approach of DNA and RNA sequences to SNP discovery has obtained the highest nonmodel SNP validation rates to date, without requiring a reference genome, and its success is largely due to the accurate detection of intron-exon boundaries (IEBs), which can confound genotyping primer design (Wang et al., 2008; see Figure 1). The IEB detection method developed by Conklin, Montes, Albaina, and Estonba (2013), for example, relies on | 8881 LANGA et AL.
computing statistically significant positions in the transcript where many genomic reads start or end, indicating possible IEBs.
Traditional approaches to gene annotation in general, and IEB detection in particular, are based on the annotation of a genome assembly. For example, the NCBI Prokaryotic Genome Annotation Process (https://www.ncbi.nlm.nih.gov/genom e/annot ation_prok/ proce ss/, last accessed 2020-03-04) relies on the prediction of transcribable regions based on alignments to known transcripts and proteins, and ab initio predictors of coding and noncoding genes. A more popular solution is to align either transcriptome or RNA-Seq reads with a splice-aware aligner such as GMAP (Wu & Watanabe, 2005), and extract from the results the IEB coordinates.
An alternative approach to finding IEBs can be based on creating a splice graph, a mathematical representation of a transcriptome where exons are represented by nodes, IEBs as edges, transcripts as paths, and genes as the different connected components. This approach is the one first presented by ChopStitch (Khan et al., 2018) where Bloom filters are used to store frequent k-mers of a shotgun whole-genome sequencing (WGS) dataset and use it to find signals of splicing in every sequence of a transcriptome assembly. This paper presents EXFI, a memory-efficient tool for predicting and annotating the exons of a de novo transcriptome assembly through a splice graph representation. This tool works by comparing transcriptomic k-mers with those sequenced in a WGS experiment, marking potential IEBs wherever a section of a transcript is not found in it. To assess its performance, we compare it with ChopStitch and GMAP, using two synthetic datasets where references are available (human being and zebrafish); two fish species (Atlantic herring and Atlantic salmon) for which there exist reference annotations and experimental WGS datasets; and two species for which there only a draft genome and transcriptome are available (sugar pine and axolotl). Finally, we applied EXFI to a recently published dataset on tench (Kumar et al., 2019) to evaluate its success in IEB detection for SNP discovery in exonic regions.
We expect this method to be useful not only in the context of the original aim, the decomposition of transcripts into exons for gene-targeted SNP genotyping in organisms where genomic references are not available or not reliable, but also in the design of array-based tools such as sequence and exome capture, exome-wide genotyping, and RNA expression microarrays. Finally, recent developments in selective nanopore sequencing (Payne et al., 2020) are very likely to increase the relevance of exome-targeted approaches such as the one described here.  A Bloom filter (BF;Bloom, 1970) is a fast and succinct data structure for set membership (i.e., to test whether a k-mer is present in a transcript). Bloom filters have been successfully used in many highthroughput sequencing problems, including k-mer counting (Melsted & Pritchard, 2011), read compression (Benoit et al., 2015), read normalization (Crusoe et al., 2015), read filtering (Chu et al., 2014), error correction (Benoit, Lavenier, Lemaitre, & Rizk, 2014;Salmela & Rivals, 2014;Salmela, Walve, Rivals, & Ukkonen, 2017;Song, Florea, & Langmead, 2014), genome assembly (Chikhi, Limasset, & Medvedev, 2016;Chikhi & Rizk, 2012;Jackman et al., 2017;Peterlongo & Chikhi, 2012), gap filling (Paulino et al., 2015;Rizk, Gouin, Chikhi, & Lemaitre, 2014;Vandervalk et al., 2015), and targeted gene assembly (Kucuk et al., 2017). The advantage of this data structure is that it is very fast and space-efficient, with the drawback of being probabilistic: It does not return false negatives, but it can produce false positives with a tunable false-positive rate (BF FPR).

| Baited bloom filter construction
This rate, for a given dataset, depends on three parameters that are under our control: the k-mer length, the amount of memory, and the number of hash functions used.
In the human and zebrafish genomes, only 4.24% and 5.68% of the bases are exons, respectively (Table 1). Therefore, this Bloom F I G U R E 1 Two cases in primer design that can lead to genotyping failure: primers in different exons that require excessive PCR extension across an intron (top); a primer spanning an IEB will fail to anneal (bottom) filter approach can be used to remove WGS reads that are not exonic, and then reduce the BF FPR by nearly an order of magnitude. Additionally, cascading Bloom filters (Salikhov, Sacomoto, & Kucherov, 2014), a modification of original the data structure, stacks together multiple Bloom filters to keep frequent-enough k-mers and discard the ones produced by sequencing errors. Together, both approaches serve to filter out irrelevant but significant fractions of the original WGS experiment.
In EXFI, build_baited_bloom_filter uses both the transcriptome assembly and the WGS reads and performs the task in three steps.
First, a Bloom filter of the transcriptome is built with biobloommaker. Second, each read of the WGS dataset that does not share at least one k-mer with the transcriptome is discarded with biobloomcategorizer. And third, the remaining reads are used to build a cascading Bloom filter with ABySS. The result is a binary file encoding the error-free k-mers of the reads that overlap the transcriptome.

| Exon and splice graph prediction
The exon and splice graph prediction procedure is carried out by the build_splice_graph script, which predicts in one step the exon sequences, the exon composition of each transcript, and the splice graph structure of the entire transcriptome. The false positives that the Bloom filter produces may cause additional nucleotides in the raw exome and disconnected exons of length k. To prevent downstream problems, exons of length less than k + q (q by default five) are filtered out ( Figure 2c). Once deleted, a more relaxed merging step is applied when exons overlap by an excessive number of bases (10 by default; Figure 2d). Finally, if the -polish flag is specified, each pair of exons with a long-enough overlap is inspected for the donor/acceptor sites (usually GU/AG; 2e) and correctly trimmed if possible.
The primary output is a GFA1 file that encodes the inferred exons in terms of sequence and coordinates, the connections between them, and the transcripts as paths of exons. This type of file can be visualized with Bandage (Wick, Schultz, Zobel, & Holt, 2015), which also is helpful to manipulate exons and transcripts of interest, as well as to perform BLAST queries. Additionally, (gfa1_to_fasta) extracts the exons in FASTA format. It can also return the spliced transcripts, where each one of them is represented by the corresponding exons separated by a predefined amount of Ns.

| Validation datasets
Four reference datasets were selected: zebrafish (Danio rerio) and human being (Homo sapiens) as the key species, due to the depth of their available annotations; and Atlantic herring (Clupea harengus) and Atlantic salmon (Salmo salar), both with complete assemblies and exon annotations. Also, Salmoniformes are known to have an additional genome duplication round not shared by the other fish species here studied (Allendorf & Thorgaard, 1984), expanding both the genome length and number of genes (and therefore transcriptome complexity; Table 1). Additionally, to serve as a bridge between reference and de novo transcriptome assemblies, an RNA-Seq muscle library from Atlantic herring was assembled. Finally, two species without annotations, sugar pine (Pinus lambertiana) and axolotl (Ambystoma mexicanum), were added to test the upper limits of the methods studied in terms of genome length and sequencing effort. These two species are known for their large genome sizes (27 and 32 GB, respectively) due   Table S1). These assemblies varied in terms of both sequencing depth and individuals, from a 22× of a single individual in axolotl to 51× of a pool of 50

TA B L E 1 Experimental statistics of the studied cases
Atlantic herring samples.

| Benchmarking metrics
The performance of EXFI was compared with two tools: GMAP (Wu & Watanabe, 2005) and ChopStitch (Khan et al., 2018). GMAP is a method used to perform gapped alignments of expressed sequence tags (ESTs) and assembled transcripts to a reference genome. Its main advantage is that is easy to use and has been used extensively to annotate eukaryotic genomes with PASA (Haas et al., 2003).
ChopStitch is a tool similar to EXFI that uses Bloom filters to predict exons and the splice graph, using the entire WGS dataset, and different exon prediction algorithms. Table 2 shows the main differences, advantages, and disadvantages between the three methods.
To compare the three methods in terms of speed and accuracy, two metrics were studied: one based on the recovery of the available annotation, and another in terms of mappability of the predicted exons to the genome.
For studying the recovery of the available annotation, reference exon coordinates in GFF format were transformed to BED, converting  In the case of mappability measurements, predicted exons were aligned against their genomic reference with BWA MEM (Li, 2013), results were stored in BAM format with SAMTools (Li et al., 2009), and the reported statistics were obtained from the number of mapped exons, and the ones mapped with a perfect CIGAR string (all matches or with small insertions and deletions, but no base clipping).

| Objectives of the benchmarks
Before comparing the three methods, it was necessary to measure the influence of four parameters that may impact performance, in terms of both time and memory consumed, and the trade-off between precision and speed. From the two metrics described above, we used the annotation-based statistics as the ones that drove the experimental design since they showed more differences in terms of percentage points, and because mapping methods require a minimum seed length, which impacts the alignment of microexons. Third, the trade-off in terms of precision and recall with varying BF k-mer lengths was analyzed. If kis set too low, k-mers become less specific and more reads are inserted into the filter, increasing the BF FPR and lowering the precision, while increasing runtime too since there are more k-mers and reads processed. On the contrary, if k is set too high there will be fewer elements to insert, and since a significant fraction of them will contain variants and sequencing errors, they will be filtered by their low frequency (lowering the BF FPR but also the recall). To find the appropriate k-mer length, EXFI was run with the lowest and highest memory settings (4 and 60 GB) and by varying the k-mer length from 21 to 65 using odd values.
Finally, an acceptable genome coverage is needed for a successful experiment. On the one hand, a WGS experiment with little coverage will make the method underperform. On the other hand, too much coverage will make the BF FPR larger than necessary because of sequencing errors. As depth increases, the total number of true k-mers reaches a plateau, while the number of k-mers that contain sequencing errors keeps growing linearly (see figure 3 in Melsted & Pritchard, 2011). Therefore, a central point must exist in between to achieve near-optimal exon precision and recall values. The zebrafish datasets were sampled in 10% increments with Seqtk , applying the procedure to each subsample, and measured the classification metrics, using both low and high memory settings and k fixed to 25 bp.
With respect to the other tools, GMAP version 2018.07.04 was executed using default parameters, and ChopStitch version 1.0.0, using the default k-mer length (50 bp) when possible, and varying the BF FPR values (and therefore different memory usages), over the six datasets (zebrafish, humans, and Atlantic herring), and we measured the performance in terms of the metrics described above: comparison against the annotations and mapping against the genome. All programs were run on a 2× Intel Xeon E5-2620 server, running in total 24 2 GHz threads, with 64 GB of RAM.

| Effects of read filtering
Filtering the reads resulted in a 68%-75% reduction of the BF FPR while also slightly improving all the classification metrics ( Figure 3 and Table S2). We can observe a benefit of the filtering in the low memory case, where the FPR fell from 32.6% to 8.1% and rose the F 1 score from 89.8% to 94.8 when the maximum is of 95.6%.
Additionally, we observe a slight reduction in time: from 172-186 to 149-173 min (Table S8). Therefore, filtering improves both the processing time and the prediction metrics. Similar conclusions can be reached in the human dataset (Table S3 and Figure S1).

| Effects of memory usage
The most significant parameter that impacts the Bloom filter is its size. Figure 3 and  Figure S1). Therefore, a 4 GB Bloom filter is enough to achieve near-optimal results. Also, it is not necessary to demand particularly low (5% or less) BF FPR to predict exons accurately.

| Effects of k-mer length
As described in Methods sections, precision and recall are related through the k-mer length. On the one hand, as k increases, the precision also increases until k = 47, when it starts to decrease rapidly ( Figure 4 and Table S4). On the other hand, recall decreased almost from the start (4 GB: k = 25; 60 GB: k = 23). According to the F 1 statistic (the harmonic mean between precision and recall), for both methods, there is a window of k-mers, from 23 to 35, where this metric remains stable, boosting the recall when k is small, and the precision when k is high. Given the results, we used for the remainder of the analysis a k-mer length of 25 bp to keep the recall as high as possible while keeping precision high too, and set it as the default value in EXFI.

| Effects of sequencing depth
The sequencing depth increased the power of the precision up to a certain point ( Figure 5 and

| Comparison with ChopStitch and GMAP on simulated and real datasets
The performance of ChopStitch, EXFI, and GMAP was compared across six species in terms of the BF FPR and sizes, classification, and mappability scores. Given the results above, we chose to run EXFI using 4 GB of RAM, and a k-mer length of 25. For ChopStitch, we used the default k-mer length of 50 bp, and default BF FPRs of 1% when possible. For GMAP, the default parameters were the ones used. In the case of the megagenomes, gmapl was used as the alignment tool.
There are several differences between EXFI and ChopStitch.
Algorithmically, in EXFI the total amount of memory to be used is specified at the beginning, the number of hash functions is fixed (to four, fixed number in the version of ABySS used), the reads are filtered and processed, and the BF FPR is returned at the end. In contrast, the reverse procedure is applied in ChopStitch: The desired BF FPRs are first specified, and the optimal sizes and number of hash functions are estimated from the full dataset of reads. This procedure selects the optimal memory (maybe unavailable) and number of hash functions to work, but requires to process twice the full WGS reads: one for estimation and other for actual computations. On the other side, EXFI hashes all the WGS reads in two steps: once for the filtering purpose, and a second time for the remainder.
In zebrafish, we considered running EXFI and ChopStitch with multiple memory/BF FPR configurations (respectively 4-60 GB in 4 GB increments, and FPR 1 varying from 20% to 1% and BF FPR 2 set to 1%). In general, EXFI outperformed both methods ( Figure 6 and Table S6) and its performance remained high and constant from 4 GB.
In both datasets, GMAP obtained the fastest data structure construction (24 and 53 min to index the zebrafish and human genomes; Finally, for the axolotl and sugar pine megagenomes, we did not obtain results for all of the methods. Due to the terabase pairs sequenced and the size of the references, ChopStitch was only able to produce a Bloom filter for the axolotl, with a BF FPR 2 of 20.2% and a k-mer length of 25 bp, and GMAP was able to build both references but failed to produce predictions in the axolotl case due to memory exhaustion. For EXFI, even though it can produce Bloom filters with 4 GB of RAM, the BF FPR 2 s were too high to work (52% and 29.4%, respectively; data not shown), and therefore, we raised the RAM to 60 GB to obtain reasonable FPRs. Indeed, we obtained data structures with BF FPRs of 3.1% and 2.0% in the Pine and axolotl cases and after 2 and 1 days of execution, respectively. In the sugar pine

| Retrospective analysis in T. tinca
From the set of 266,578 input transcripts, EXFI predicted 1,072,772 exons. In total, after quality and distance filtering, 228,931 SNPs and 26,169 indels were predicted suitable for genotyping. All IEBs F I G U R E 5 Precision and recall values of EXFI depending on the sequencing depth, using the minimum and maximum memory settings and the k-mer length fixed to 25. Both settings produced similar results, obtaining higher metrics when all the memory was used. Around 25-30× almost all error-free k-mers are sampled, and then, sequencing errors start to pollute the Bloom filter. Both mapping rates stayed above 98.7% proximal to the 96 SNPs described in Kumar et al. (2019) using the Conklin et al. (2013) method were detected by EXFI (therefore 100% precision over this set of SNPs). One SNP was proximal to a falsepositive EXFI IEB, due to multiple variants in a short space, indicating that it would not have been selected for genotyping primer design.
Therefore, on this retrospective SNP discovery task, EXFI would recall 95 of 96 of the selected SNPs.

| D ISCUSS I ON
We developed EXFI, a method that reliably predicts the exon sequences and splice graph of a species using a de novo-assembled transcriptome and raw WGS reads. We tested it in multiple eukaryotic species, varying the genome and transcriptome reference status, simulated and experimental datasets, and samples with different level of heterozygosity of the samples. We found out that EXFI performs better in terms of memory and classification than other tools when describing the structural annotation of every transcript.
We studied the four principal parameters that can affect the prediction procedure: read filtering, memory, k-mer length, and genome coverage. First, by filtering the transcriptome, we ended up reducing by two-thirds the BF FPR while also slightly decreasing the execution time. Therefore, this reduction can be translated into a memory optimization. Second, using more than 4 GB of RAM (and higher BF FPR) yielded equally accurate predictions as using 60 GB (Figures 3 and 6). Thus, commodity desktop and laptop computers are enough to achieve accurate exon predictions on gigabase-sized genomes. Third, our approximation predicted a window of optimal k-mer length values between 23 and 35 base pairs. Finally, we show that 20× coverage is good enough for exon prediction, with optimal coverage between 30× and 40×.
We compared EXFI against ChopStitch, a similar method, and GMAP, a splice-aware program designed to align transcripts to a reference genome. We used datasets that vary in genome size, sequencing depth, number of individuals, and type of input transcriptome.
When taking into account the general picture across the different reference species (zebrafish, humans, salmon, and herring), even with higher BF FPR rates, EXFI obtained better prediction metrics, except for the Atlantic herring (Table 3). In that case, EXFI was less accurate predicting exons, but in terms of mappability, it achieved the highest results across all datasets (99.5%). This high mappability result, although more optimistic than the exon precision, means that even if the exon prediction is not precise, it is still usable for downstream analysis. These perfectly matched predictions are interior sections of the exons rather than the full sequence, which makes them suitable for genotyping, array design, and sequence capture.
We also studied three situations where the input transcriptome was de novo-assembled from RNA-Seq reads: Atlantic herring, to compare the differences between reference and assembled transcriptomes, and the megagenomes of axolotl and sugar pine. In terms of mappability, all three methods performed worse than in reference Note: Best metrics across the three methods are marked in bold. Time is the sum of the walltimes at the building and prediction steps. When possible, the steps were run using all processors available, that is, in ChopStitch's and EXFI's build steps, and in GMAP's predict stage. Memory, expressed in Gigabytes, represents the peak usage in memory. FPR represents the false-positive rate of the Bloom filter used for prediction. Mapped and Perfect stands for the overall alignment rate of the predicted exons, allowing and not allowing clipping, respectively. EXFI was executed to use only 4 GB of RAM except for the megagenomes. ChopStitch with k-mer lengths of 50 bp and FPRs of 1%, except when memory usage was an issue. In the cases marked with asterisks, the k-mer lengths were lowered to 25 bp, and target FPR values were tested one by one in the set of 1%, 5%, 10%, 15%, and 20%. Actual FPRs are the ones reported. In general, when a reference transcriptome was used, EXFI obtained the best precision, while ChopStitch obtained better recall. With respect to alignments to the genomes, EXFI obtained the best mapping rates. et al., 2019). EXFI was also able to find hundreds of thousands of SNPs across almost a million exons. With respect to the set of known genotyped exons, EXFI obtained 100% precision and 99% recall.
These positive results for EXFI are due to the read filtering step and the exon prediction rules used. The filtering step is critical in eukaryotic genomes because a significant fraction of the WGS dataset is not only unnecessary but misleading. The convenience of the exon prediction rules is extracted from Figures 3 and 6: When comparing ChopStitch and EXFI without read filtering, the latter obtained slightly superior precision and recall due to the exon prediction methods, in spite of the relatively high BF FPR (8% vs. 1%).
Moreover, EXFI's predictions are accurate enough when working with relatively high BF FPR 2 .
Previous structural annotation algorithms rely on a whole-genome assembly followed by the mapping of RNA-Seq reads, ESTs and transcripts, and homology predictions against genome, transcriptome, and protein databases. Our results suggest that EXFI is a reliable tool too while avoiding completely the step of generating a high-quality genome assembly.

Recent reviews have been published on RAD-Seq and Targeted
Sequencing approaches ( For optimal results, we propose a two-step experimental approach to study nonmodel exomes: an initial exploration of the exome structure and the variants it contains, followed by targeted sequencing of hundreds to thousands of samples. For the first step, it would be necessary to sequence RNA from as many tissues and development stages, aiming to get the best representation of the transcriptome, and to sequence between 30× and 40× of the genome, preferably from multiple individuals, to discover as many variants as possible. In this regard, Therkildsen and Palumbi (2017) have shown that is possible to move from pools of DNA to individually barcoded individuals. In a second step, a targeted approach would be obtained for thousands of loci and samples, leaving behind most of the genome and therefore being able to fit more individuals and populations in the same sequencing assay. As it happened for Atlantic herring, a DNA sequencing effort initially focused on the transcriptome (Lamichhaney et al., 2012) was reused years later once genome assembly was possible (Barrio et al., 2016).
This report has presented EXFI, a pipeline that predicts the splice graph and exon sequences from a transcriptome and WGS reads instead of a reference genome. Different parameters that affect its performance were studied: read filtering, memory usage, k-mer length, and sequencing depth. Tests were carried out on zebrafish and human simulations, Pool-Seq samples of Atlantic salmon and Atlantic herring, and the megagenomes of the sugar pine and axolotl, varying all in sequencing depth, heterozygosity, genome length, and complexity. A retrospective analysis of a recently published set of transcriptomic SNPs on tench was also done, obtaining 100% precision and 99% recall. It is shown that it is possible to perform structural annotation of a transcriptome of heterogeneous samples with low computational resources. Finally, EXFI is expected to be particularly useful for population genetic studies, phylogenetic relationships, and RNA expression in nonmodel species.

ACK N OWLED G M ENTS
We acknowledge support with the predoctoral grant (PRE_2017_2_0169) and the Genomic Resources Research Group (grant IT558-10), both funded by the Basque Government.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
EXFI is open source and freely available at https://github.com/ jlang a/exfi. It is subject to Continuous Integration and Unit Testing.
Detailed installation and usage are available in the README file of the repository (https://github.com/jlang a/exfi/README.md).
Additionally, test data are included in the repository. Moreover, a Dockerfile is available at https://github.com/jlang a/exfi-docker to create a container with all the tools installed. Finally, the scripts to validate, benchmark, and reproduce the tables and figures in this document can be found online as a Snakemake pipeline (Köster & Rahmann, 2012) at https://github.com/jlang a/smsk_exfi_paper.