The Bellerophon pipeline, improving de novo transcriptomes and removing chimeras

Abstract Transcriptome quality control is an important step in RNA‐Seq experiments. However, the quality of de novo assembled transcriptomes is difficult to assess, due to the lack of reference genome to compare the assembly to. We developed a method to assess and improve the quality of de novo assembled transcriptomes by focusing on the removal of chimeric sequences. These chimeric sequences can be the result of faulty assembled contigs, merging two transcripts into one. The developed method is incorporated into a pipeline, which we named Bellerophon, that is broadly applicable and easy to use. Bellerophon first uses the quality assessment tool TransRate to indicate the quality, after which it uses a transcripts per million (TPM) filter to remove lowly expressed contigs and CD‐HIT‐EST to remove highly identical contigs. To validate the quality of this method, we performed three benchmark experiments: (1) a computational creation of chimeras, (2) identification of chimeric contigs in a transcriptome assembly, (3) a simulated RNA‐Seq experiment using a known reference transcriptome. Overall, the Bellerophon pipeline was able to remove between 40% and 91.9% of the chimeras in transcriptome assemblies and removed more chimeric than nonchimeric contigs. Thus, the Bellerophon sequence of filtration steps is a broadly applicable solution to improve transcriptome assemblies.

power and the transcriptome will be of high quality as long as the genome is of high quality. For transcriptomes of organisms without a reference genome, there is the second method: a de novo transcriptome assembly for which no reference data are required. The most commonly used de novo transcriptome assembler is Trinity (Haas et al., 2013). This tool uses de Bruijn graphs to construct contigs from overlapping cDNA reads (Grabherr et al., 2011). However, a de novo assembly requires high computational power and its quality is difficult to assess, because of the lack of reference DNA or RNA data to compare it with (Li et al., 2014). Sequencing errors can greatly alter the assembled transcriptome, which thus induces errors in the differential gene expression analysis (Marchant et al., 2016;Martin & Wang, 2011).
Different types of errors can occur during a de novo transcriptome assembly process (see Smith-Unna et al., 2016, 1). For example, assembled transcripts can be incomplete, one transcript can be assembled into multiple contigs, or multiple transcripts can be fused into one contig. Chimeric sequences can occur naturally in transcriptomes (i.e., not the result of assembly errors), but these sequences are rare (Frenkel-Morgenstern et al., 2012). False chimeric contigs are the product of a misassembly of multiple different transcripts that have erroneously been assembled together into one contig. This can occur when de Bruijn graph extension is difficult due to repeated regions or when two sequences are almost identical (Lima et al., 2017). There are two defined types of false chimeric sequences: (a) the contig can be composed from different isoforms of the reference transcript, which is called a self-chimera, (b) the contig can be composed from two different transcripts, which is called a multi-chimera (Yang & Smith, 2013).
Assembly errors might be identified and filtered out by mapping the cDNA reads to the assembled contigs (Smith-Unna et al., 2016).
Different patterns of read coverage can be evidence for different types of errors. For example, high variation between the number of reads mapped to a contig, or the lack of reads mapping to a contig, can indicate inappropriately assembled transcripts. In general, contigs should be evenly expressed, because different parts of a correctly assembled transcript should not be differentially expressed.
An uneven expression pattern is typical of false chimeric contigs.
The exception is when multiple splicing variants of a gene are present in the transcriptome and assembled by the assembler.
Only a short list of tools are available to assess the quality of a de novo transcriptome. The tools KisSplice (Sacomoto et al., 2012), DRAP (Cabau et al., 2017), RSEM-EVAL (Li et al., 2014), and TransRate (Smith-Unna et al., 2016) all assess the quality of a transcriptome. DRAP and KisSplice are transcriptome assemblers on their own, focusing on transcriptome quality assessment and chimera removal, while RSEM-EVAL and TransRate are postassembly tools. When working with already assembled transcriptomes that need to be optimized, RSEM-EVAL, and TransRate would be a better choice, as de novo assembly remains a computationally intensive task that is not easily redone. RSEM-EVAL requires a reference set of transcripts, which can be from a closely related species, and uses the reference set to estimate transcript length distribution. This thus makes RSEM-EVAL not truly reference-free. TransRate is truly reference-free and only requires the sequencing reads and the assembled transcriptome.
As gene expression levels in RNA-Seq experiments are determined by the relative number of reads that are aligned to a contig, and chimeras in an assembly make the read mapping more difficult, chimeras alter the accuracy of differential gene expression analysis.
For example, if the original sequence of a chimera remains in the assembly, the reads of these transcripts are assigned to the chimera, which likely alters the observed level of expression. In addition, novel transcripts discovery can be complicated by chimeric sequences: (a) chimera can be mistaken for unknown transcripts (b) annotations of new transcripts can be difficult when a contig is composed from multiple transcripts. Removing these chimeras can be a complicated task, because it would require a full transcriptome annotation. This annotation then would have to be screened for genes with double annotations and even then, there is no guarantee that chimeras can be located.
We developed a pipeline that is specifically aimed to filter out chimeras to reduce false-positive gene discovery and false-negative differentially expressed genes. To achieve this goal, we focused on three research aims: (a) to develop a method to assess a de novo transcriptome, (b) to use the quality assessment to improve the transcriptome assembly, with specific focus on the removal of chimeric sequences, (c) to make the method as broadly applicable as possible, and to make it as easily applicable as possible. The method of quality assessment and quality improvement is incorporated in an easy to use pipeline with optional user customizability. The pipeline is named after Bellerophon, the hero in Greek mythology that slayed the Chimera. Using Bellerophon to target and remove assembly errors is a useful addition to the short list of transcriptome quality improvement tools currently available.

| The Bellerophon pipeline
To automate the process of filtering and optimizing the transcriptome assembly, the Bellerophon pipeline was developed. This pipeline requires only the sequencing reads and the transcriptome assembly. The user is able to customize the cut-off scores used in the filtration, the order in which to apply the filters and the number of threads used. Bellerophon automatically generates a report that states the results of the pipeline. The user is able to customize the filtering order, but works by default as follows ( For CD-HIT, the default value was determined at 0.95 in order to make the filter more lenient compared with the tool's default of 0.9.
Although these values are set as default by Bellerophon, they can be easily changed to accommodate the user's dataset.

| Validation
To test and validate the Bellerophon pipeline, we used two datasets: (a) RNA-Seq data obtained by Illumina sequencing the pheromone glands of females from a laboratory strain of Heliothis subflexa (Lepidoptera, Noctuidae), which first needed to be assembled de novo, (b) a simulated RNA-Seq experiment obtained through the tool polyester (Frazee, Jaffe, Langmead, & Leek, 2015) and a reference transcriptome of 3,000 transcripts expressed by Drosophila melanogaster (Diptera: Drosophilidae).
These datasets were used in three different experiments. The first experiment used the de novo assembly of the H. subflexa pheromone gland transcriptome and chimeras computationally created from this transcriptome. Details on the assembly of the H. subflexa pheromone gland transcriptome assembly can be found in the Appendix S1. The second experiment focused on the chimeras contained in the 10 contigs groups of the H. subflexa pheromone gland transcriptome for which the assembler predicted the most isoforms.
The third experiment used a Trinity assembly generated from simulated RNA-Seq experiment of 3,000 D. melanogaster transcripts.
In this experiment, we were able to recognize rightfully assembled contigs from chimera by comparing them with the reference D. melanogaster transcripts. Additionally, BUSCO was used to compare the completeness of the H. subflexa transcriptome before and after filtering with Bellerophon, using the Insecta odb9 ortholog set.

| Validation using computationally created chimeras
To benchmark the performance of Bellerophon, a set of 500 chimeras was created by randomly selecting two sequences from the H. subflexa RNA-Seq dataset. These sequences were combined by randomly choosing a percentage between 30% and 70% overlap and concatenating the sequences in these proportions. The newly generated chimeras were placed with the other contigs in the assembly. This process was repeated five times. Each assembly with created chimeras was subjected to the Bellerophon pipeline. To test if a significant percentage of chimeras was removed by each step, we compared it with the mean percentage of sequences that was removed by the same step using an unpaired t test followed by a Bonferroni correction.

| Validation using real assembled chimeras in isoform-rich contig groups
Trinity uses an algorithm to find possible isoforms, which occasionally produces more isoforms than actually occur in vivo. This makes groups of isoform-rich contig good candidates to search for chimeric sequences. The 10 contig groups with the highest number of isoforms were selected and blasted against the nonredundant protein database (NR), using the BLASTX algorithm (E-value cut-off: 10 -4 ).
Contigs were identified as chimeric when a sequence contained two matches that did not overlap in their mapping region and had hits with different genes. These contigs are shown to have multiple transcripts on one contig and were marked as a chimera. To measure the ability of the different steps to remove chimeric contigs, we counted the number of marked chimeras left in the assemblies after each steps.

RNA-Seq experiment
To further evaluate the performance of the our filtering methods, we used the tool Polyester (Frazee et al., 2015) to generate RNA-Seq reads from a random selection of 3,000 transcripts from the D. melanogaster reference genome (NCBI RefSeq GCF_000001215.4).
The expression profile of D. melanogaster transcripts was defined through the "fpkm_to_counts" and the "create_read_numbers" functions of polyester, using the expression values of the contigs from our assembly of the H. subflexa female pheromone gland as input.
Three sets of 20,370,192, 19,995,866, and 20,045,180 reads were generated using the "simulate_experiment_countmat" function.
Assembled contigs matching less than 5 reads were removed. To determine which assembled contigs were chimeric and which were not, we blasted the assembled transcriptome against the reference D. melanogaster transcriptome (BLASTn, ID percentage cut-off: 90%; E-value cut-off: 10). Contigs matching more than one transcript from the reference transcriptome were considered chimeric. Contigs matching exactly one were not. F I G U R E 1 Bellerophon pipeline default filtration order. Violet paths use TransRate-Q and BUSCO to assess assembly quality. Orange paths display the sequential filtering steps. The output of each filtering step is used as the input of the following filtering step

| Validation using a D. melanogaster RNA-Seq experiment
To further evaluate the performance of our filtering methods, we assembled a de novo transcriptome using RNA-Seq reads, from D. melanogaster virgin female heads, available on the Sequence Read Archive of the NCBI (Bioproject: PRJNA527373; experiments: SRR8735410, SRR8735411 and SRR8735412). The transcriptome was assembled following the same protocol as above. We blasted the contigs assembled by trinity against the reference set of D. melanogaster transcripts with the following filters: E-value ≤ 10 -3 , percentage of identify ≥ 90, length of alignment ≥ 300 nucleotides.
D. melanogaster transcripts were related with their gene of origin.
Transcriptome contigs matching only one gene were considered as nonchimeric. Contigs matching multiple genes were considered as chimeric. Contigs matching no genes were considered as unidentified.

| Bellerophon optimal filtering order
Using the three filter components CD-HIT-EST, TPM, and ORF length, an optimal filtering method was designed. To assess the performance of the filter, TransRate-Q was run after each filtering step. Figure 2 displays the TransRate quality score and the number of transcripts remaining after filtering. The filter with the highest resulting TransRate score was the filter with only CD-HIT-EST applied  1 2 3 89 10 11 12 1 2 3 89 10 11 12 1 2 3 89 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Since the TPM-CD-HIT-EST filtering reduced the number of segmented contigs in the assembly more than the reverse filtering, we set the default order to be (1) TPM filtering and (2) CD-HIT-EST (filtering order 12 in Figure 2), as the ORF-length filtering appeared to be removing too much transcripts. This was the used order for the chimera removal benchmarking. However, the user of Bellerophon can use any set of filters in any desired order.

| BUSCO and contig length benchmarking
Running BUSCO on the reference transcriptome before and after the Bellerophon pipeline shows a reduction in the number of present groups from 1511 (91.1% of total) to 1,420 (85.8% of total). In contrast, the number of duplicated groups is reduced from 562 to 182 (22.9% of the total number of groups). The BUSCO benchmark numbers are shown in Table 1.
The length of contigs present in the assemblies or removed from them before and after each filtering step are plotting on the Figure   S1a. The mean sequence length is higher for the contigs removed by the TPM and CDHIT filtering test than for the contigs kept by those filters ( Table 2). The contigs removed by the final TransRate-C filtering step of Bellerophon are shorter than those that were not removed.

| Validation using real assembled chimeras in isoform-rich contig groups
When focusing on contigs that belong to isoform groups with a large number of contigs, we found that 68 out of 74 (92%) chimeras were removed by the Bellerophon pipeline. Of these 68, 66 were removed by TPM filtering and two by CD-HIT-EST. TransRate-C did not remove any chimera from this set. However, running TransRate-C before running RSEM decreased the number of chimeric isoforms in the benchmark set from 74 to 18, thus removing 56. The quality score of the assemblies in this experiment had increased from 0.247 to 0.336.

RNA-Seq experiment
In this simulated RNA-Seq experiment, we used 3,000 transcripts expressed by D. melanogaster as a reference to simulate RNA-Seq reads. We then assembled those reads with Trinity, which resulted in 3,709 contigs. By blasting the Trinity assembly against the reference D. melanogaster expressed transcripts, we could relate 3,578 contigs to the original 3,000 D. melanogaster transcripts from which the reads were generated. Through this blasting, we identified 295 contigs of the 3,578 contigs as chimeric, and the other 3,283 contigs as correctly assembled. Figure 4

| Validation using a D. melanogaster de novo transcriptome
In this experiment to further test the effect of Bellerophon on de novo transcriptome assembly, we assembled a transcriptome of  While the number of contigs matching only one gene in the assembly is reduce by a substantial amount by the Bellerophon filtering, the number of unique genes in the assembly is reduced from 10,527 to  In order to check whether Bellerophon removed the shortest sequence of the assemblies, we plotted the number of sequence present and removed at each filtering steps ( Figure S1b). The mean length of the contigs kept or removed along the filtering steps of the pipeline are displayed on the Table 2. These values show that TPM and TransRate-C filtering remove contigs longer than those they keep. Furthermore, the mean length of chimeric contigs are always higher than then mean length of chimeric contigs not filtered out (Table 2).

| D ISCUSS I ON
When there is no reference genome available, the quality of a de assembly. The most unambiguous way to distinguish chimera from alternative splicing is by comparing the contigs to a reference genome, which is thus problematic when no reference is available. However, filtration of rightfully assembled transcripts arising from alternative splicing should be a minor problem in insects, especially for transcriptomes of one or few tissues, because (a) alternative splicing of genes appears to be less common in invertebrates than in vertebrates, with a maximum reported frequencies below 40% (Gibilisco, Zhou, Mahajan, & Bachtrog, 2016;Kim, Magen, & Ast, 2007;Wang et al., 2008), and (b) the majority of alternative splicing events show tissue specificity (Hallegger, Llorian, & Smith, 2010). However, users which have a particular interest in alternatively spliced isoforms should consider not using the CD-HIT-EST filtering step.
When selecting for the filtering order, the RSEM (TPM) filtering step was observed to be the step of the pipeline removing a higher percentage of chimeras than other sequences. This is probably because lowly expressed transcripts have less read evidence and are thus more prone to assembly errors. Furthermore, chimeric sequences are bound to share read mapping with other contigs of the assembly and as such might appear to be lowly expressed.
As this first step removes many chimeras, the performance of the following steps may be reduced because the leftover chimeras may be more difficult to identify. The low number of chimeras discarded by CD-HIT-EST might be explained by the fact that the benchmarkchimeras were randomly selected. CD-HIT-EST works by clustering transcripts based on their sequence identity. The chance of a transcript made up of two randomly chosen transcripts that are 95% identical to another transcript is very low. In the benchmark experiment focusing on assembled chimeras in isoform-rich contig groups, the final TransRate-C step of the pipeline did not seem to remove any chimera. The transcripts from these groups presumably belong to one gene family, while a large number of isoforms are created by Trinity. Probably, fewer reads aligned to the false isoforms than to the real isoforms, so that the false isoforms had a low overall expres- transcripts (Li et al., 2012), the high number of transcripts in our dataset shows an over-prediction of isoforms and other assembly errors by Trinity. The D. melanogaster transcriptome filtering allowed us to observe that gene representation is only marginally impact by Bellerophon. Eliminating chimeras and other assembly errors has made our dataset cleaner and more optimized for further differential expression analysis of RNA-Seq experiments.

ACK N OWLED G M ENTS
We thank David G. Heckel for his comments on the manuscripts, Rik

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
JK, AF, WM, and ATG designed the research. JK and AF performed the research. All authors wrote the paper.

DATA AVA I L A B I L I T Y S TAT E M E N T
The raw reads used to assemble the Heliothis subflexa phero-