A decade of de novo transcriptome assembly: Are we there yet?

A decade ago, de novo transcriptome assembly evolved as a versatile and powerful approach to make evolutionary assumptions, analyse gene expression, and annotate novel transcripts, in particular, for non‐model organisms lacking an appropriate reference genome. Various tools have been developed to generate a transcriptome assembly, and even more computational methods depend on the results of these tools for further downstream analyses. In this issue of Molecular Ecology Resources, Freedman et al. (Mol Ecol Resourc 2020) present a comprehensive analysis of errors in de novo transcriptome assemblies across public data sets and different assembly methods. They focus on two implicit assumptions that are often violated: First, the assembly presents an unbiased view of the transcriptome. Second, the expression estimates derived from the assembly are reasonable, albeit noisy, approximations of the relative frequency of expressed transcripts. They show that appropriate filtering can reduce this bias but can also lead to the loss of a reasonable number of highly expressed transcripts. Thus, to partly alleviate the noise in expression estimates, they propose a new normalization method called length‐rescaled CPM. Remarkably, the authors found considerable distortions at the nucleotide level, which leads to an underestimation of diversity in transcriptome assemblies. The study by Freedman et al. (Mol Ecol Resourc 2020) clearly shows that we have not yet reached “high‐quality” in the field of transcriptome assembly. Above all, it helps researchers be aware of these problems and filter and interpret their transcriptome assembly data appropriately and with caution.

A decade ago, de novo transcriptome assembly evolved as a versatile and powerful approach to make evolutionary assumptions, analyse gene expression, and annotate novel transcripts, in particular, for non-model organisms lacking an appropriate reference genome. Various tools have been developed to generate a transcriptome assembly, and even more computational methods depend on the results of these tools for further downstream analyses. In this issue of Molecular Ecology Resources, Freedman et al. (Mol Ecol Resourc 2020) present a comprehensive analysis of errors in de novo transcriptome assemblies across public data sets and different assembly methods. They focus on two implicit assumptions that are often violated: First, the assembly presents an unbiased view of the transcriptome. Second, the expression estimates derived from the assembly are reasonable, albeit noisy, approximations of the relative frequency of expressed transcripts. They show that appropriate filtering can reduce this bias but can also lead to the loss of a reasonable number of highly expressed transcripts. Thus, to partly alleviate the noise in expression estimates, they propose a new normalization method called length-rescaled CPM. Remarkably, the authors found considerable distortions at the nucleotide level, which leads to an underestimation of diversity in transcriptome assemblies. The study by Freedman et al. (Mol Ecol Resourc 2020) clearly shows that we have not yet reached "high-quality" in the field of transcriptome assembly. Above all, it helps researchers be aware of these problems and filter and interpret their transcriptome assembly data appropriately and with caution.

K E Y W O R D S
assembly, bioinfomatics, error correction, phyloinfomatics, transcriptomics [Correction added on 02-November-2020, after first online publication: Projekt Deal funding statement has been added.] struggling to get it right (the last 20%, so to speak), as Steven Salzberg noted in a recent report on pervasive assembly and annotation errors (Salzberg, 2019). The same, if not worse, applies to the analysis of high-throughput transcriptome sequencing data (RNA-Seq), where (de novo) assembly is a prominent first analysis step. While the assembly of transcriptomes has become an everyday bioinformatics task, dealing with all the potential errors and small caveats is still a challenge and error-prone, even a decade after the emergence of the first tools (Birol et al., 2009;Grabherr et al., 2011;Schulz et al., 2012).
In their recent study, Freedman et al. (2020) extensively analysed errors, bias, and noise in de novo transcriptome assemblies. In its most common application, RNA-Seq short reads are aligned to a reference genome (map-to-reference, as Freedman et al., [2020] refer to it) to functionally annotate genomic features (such as genes) and estimate their expression levels. In another application, RNA-Seqderived reads can be (de novo) assembled first to reconstruct the transcriptome and then use it as a proxy for annotation and expression evaluation (map-to-transcriptome).
According to Freedman et al. (2020) de novo transcriptome assembly is based on two implicit assumptions. First, the assembled sequences represent an unbiased view of the underlying expressed transcriptome, and second, the expression estimates of the assembly are good, if noisy, approximations of the relative frequency of expressed transcripts (Freedman et al., 2020). It is evident that these two assumptions have important implications for further downstream analysis steps and directly affect gene expression estimates, variant invocation, and evolutionary analyses based on a de novo transcriptome assembly. In their work, Freedman et al. show that these assumptions are frequently violated across different public mice RNA-Seq data sets and assembly algorithms, thus directly impacting downstream analyses performed on de novo transcriptome assemblies. In particular, they focused on expression estimation bias and differences in nucleotide variant calls while also comparing de novo results against a map-to-reference approach.
First, Freedman et al. (2020) dispel the illusion that de novo transcriptome assemblies are mainly composed of full-length transcripts, which is typically not the case for short reads. The authors continue to carry out that the functional composition of a transcriptome assembly is biased towards intronic, UTR, and intergenic sequences, although most studies focus on protein-coding genes. As an important finding, they describe frequent genotyping error rates ranging from 30% to 83% that, in particular, negatively bias heterozygosity estimates ( Figure 1). Their results also show that single contigs are poor expression estimators. Although commonly done in the current gene  (Hölzer & Marz, 2019). Nevertheless, modern multitool ensemble approaches for de novo transcriptome assembly achieve promising results (Voshall et al., 2020). However, the implicit assumptions and their violation, as discussed extensively by Freedman et al. (2020) urgently require control mechanisms and corresponding normalization and filter steps, especially with such combined approaches.
Finally, Freedman et al. (2020) give a brief outlook on the application of long reads derived from single-molecule real-time sequencing (SMRT), as provided e.g., by PacBio or Oxford Nanopore Technologies (ONT), to generate a provisional genome assembly in the absence of a suitable reference genome. Such a draft can then be used for map-to-reference transcriptome analyses. However, other problems may arise, and, as Freedman et al. (2020) describe, genome assembly is not necessarily a panacea for all issues related to expression analysis.
With a view of today's technology, one could even argue that the transcriptome assembly of short reads will become obsolete in the coming years. SMRT is already capable of generating long reads that can potentially span full-length transcripts -no assembly required!?
In addition, ONT allows for the direct sequencing of native RNA molecules (dRNA-Seq) without any fragmentation steps and cDNA conversion. Recently, the application of ONT dRNA-Seq for the detection of differential expression of human cell populations impressively showed the potential of the technology to overcome many limitations of short and long cDNA sequencing methods (Gleeson et al., 2020). However, even with the complete avoidance of biases introduced by de novo transcriptome assembly of short reads, not all problems are immediately solved by switching to another technology. Instead, other noise classes occur, such as a higher sequencing error rate for dRNA-Seq, which researchers need to know and which must be taken into account by novel tools. Thus, hybrid approaches combining the strengths of both short and long reads will become more important, in particular in the context of de novo assembly and transcriptome analyses. In any case, one thing will certainly not let us go: the careful handling of transcriptome data and their interpretation with regard to error, noise, and bias.

ACK N OWLED G M ENTS
Open access funding enabled and organized by Projekt DEAL.