Anne Nitsche and Gero Doose are joint first authors.
Atypical RNAs in the coelacanth transcriptome
Article first published online: 30 OCT 2013
© 2013 Wiley Periodicals, Inc.
Journal of Experimental Zoology Part B: Molecular and Developmental Evolution
Special Issue: Genome of the African Coelacanth
Volume 322, Issue 6, pages 342–351, September 2014
How to Cite
2013. Atypical RNAs in the coelacanth transcriptome. J. Exp. Zool. (Mol. Dev. Evol.) 322B:342–351., , , , , , , , , .
Conflicts of interest: None.
- Issue published online: 6 AUG 2014
- Article first published online: 30 OCT 2013
- Manuscript Accepted: 16 AUG 2013
- Manuscript Revised: 22 JUL 2013
- Manuscript Received: 24 APR 2013
- European Union FP-7 Project QUANTOMICS. Grant Number: 222664
- Federal Ministry of Education and Research in Germany (BMBF) Project ICGC MMML-Seq. Grant Number: 01KU1002J
- Italian Ministry for University and Research. Grant Numbers: UNIVPM6170, UNIVPM6823
Circular and apparently trans-spliced RNAs have recently been reported as abundant types of transcripts in mammalian transcriptome data. Both types of non-colinear RNAs are also abundant in RNA-seq of different tissue from both the African and the Indonesian coelacanth. We observe more than 8,000 lincRNAs with normal gene structure and several thousands of circularized and trans-spliced products, showing that such atypical RNAs form a substantial contribution to the transcriptome. Surprisingly, the majority of the circularizing and trans-connecting splice junctions are unique to atypical forms, that is, are not used in normal isoforms. J. Exp. Zool. (Mol. Dev. Evol.) 322B: 342–351, 2014. © 2013 Wiley Periodicals, Inc.
Coelacanths, as the most basal branch of the sarcopterygian lineage, have the potential to shed light on many of the features of their ancient relatives. The genomic sequence of the African species, Latimeria chalumnae, has revealed a retarded evolutionary rate for its protein-coding sequences compared to land vertebrates, while most other genomic features evolve at comparable speed. Many salient changes relative to osteichthyan genomes can be attributed to the the vertebrate adaptation to land (Amemiya et al., 2013). Here, we will be concerned with global patterns of the coelacanth's transcriptome.
High throughput transcriptome sequencing provides a view on the RNA content of a sample in unprecedented depth and detail. Although the technology as such promises largely unbiased data, it requires elaborate processing of raw data. It is at this step that preconceptions about what we expect to see in a transcriptome can guide quality control and noise filtering procedures. As a result, these are more often than not prone to ignoring those parts of the data set that do not fit to the established paradigms. In this contribution, we therefore focus on this blind spot and investigate in detail those reads that do not map locally and colinearly to their reference genome. In fact, several classes of “atypical” transcripts have been observed in previous studies.
A substantial fraction of the spliced human transcripts from hundreds of loci are circularized (Salzman et al., 2012; Jeck et al., 2013). Some prominent examples of this class indeed have been known for more than a decade (see, e.g., Zaphiropoulos, 1996; Caldas et al., 1998; Surono et al., 1999). Until recently, however, they have received little attention. As no function was known for them they were usually considered failures of the splicing machinery. Burd et al. (2010), however, reported a specific association of circular isoforms of the ANRIL ncRNA with atherosclerosis risk. RNase R treatment to digest linear RNAs has demonstrated that the circularized products cannot be explained as “RTfacts” (Suzuki et al., 2006; Jeck et al., 2013). Several interesting genes have circular isoforms conserved between human and mouse (Hansen et al., 2013; Jeck et al., 2013; Memczak et al., 2013), furthermore, demonstrated that circular transcripts may function as microRNA sponges. Reports of “exon shuffling” (Al-Balool et al., 2011) appearently refer to circular RNAs of which only a part is observed as PCR product. A presumably related, even less-well studied phenomenon is exon repetition (Frantz et al., 1999; Rigatti et al., 2004), which also affects hundreds of loci in mammalian genomes (Dixon et al., 2005). Like circularization, it occurs locally within the boundaries of given genes. Local strand-switching has also been observed in a few peculiar cases, such as the mod(mdg4) locus in Drosophila (Dorn et al., 2001), where complex trans-splicing may have evolved in fruitflies as a response to local DNA rearrangements (Labrador and Corces, 2003).
Trans-splicing, that is, the joining of independently transcribed parts, often from distant loci, is a well-known mechanism in organisms such as typanosomatids or nematodes but has been generally thought not to play a role, or at least not to be a wide-spread phenomenon, in vertebrates (Zaphiropoulos, 2011). Nevertheless, several conspicuous examples have been known since the end of the last millenium (Caudevilla et al., 1998; Li et al., 1999). Artificial constructs that efficiently undergo trans-splicing are described, for example, by Viles and Sullenger (2008). Large-scale transcriptome sequencing in a variety of species, including human, indicates an abundance of non-colinear RNAs, with pieces deriving from distant genomic regions and even different chromosomes (Gingeras, 2011). Chimeric mRNAs in pigs with a focus on comparing expression across individuals and breeds, were studied in detail by Ma et al. (2012). High tissue specificity of chimeric transcripts is reported by Frenkel-Morgenstern et al. (2012) together with evidence that several of these transcripts give rise to proteins detectable by multiple shotgun mass spectrometry. A data base of chimeric transcripts in human, mouse, and fly has recently become available (Frenkel-Morgenstern et al., 2013).
Li et al. (2009) observed that less than 20% of chimeric transcripts have canonical splice sites. About half of the instances instead feature small homologous sequences at the junction. “RTfacts” that are generated by reverse transcription and mimic spliced RNAs during cDNA synthesis, however, also involve short direct repeats (Cocquet et al., 2006; Roy and Irimia, 2008; Houseley and Tollervey, 2010), which are brought into close proximity by complementary base pairing at the ends of the “intron.” A detailed analysis of trans-splicing in fruitflies concludes that most, if not all, apparent chimeric RNA products from distant loci are artifacts, while trans-splicing between homologous alleles occurs frequently (McManus et al., 2010). In contrast, Djebali et al. (2012) observe several statistical signatures of chimeric transcripts in the ENCODE data, in particular a correlation with spatial proximity in vivo that strongly support a biological origin of these RNAs. Another line of evidence in support of the reality of chimeric transcripts is the observation that coding sequences on fusion transcripts contain complete protein domains significantly more often than in random control data sets (Frenkel-Morgenstern and Valencia, 2012).
We are interested here in particular to what extent “atypical” transcripts observed in mammals are also prevalent in coelacanth. To this end we re-analyze transcriptomic data sets for L. chalumnae (Amemiya et al., 2013) and Latimeria menadoensis (Pallavicini et al., 2013). In addition, we use the increased sensitivity of the mapping procedures used here to refine and expand the annotation of splice junctions, leading to the annotation of a large number of novel putative lincRNAs.
In this work four transcriptome data sets have been analyzed. Coelacanth RNA-seq samples were obtained based on liver (SRR576100) and testis tissue (SRR576101) from a single individual of L. menadoensis (Pallavicini et al., 2013) and muscle tissue of a specimen of L. chalumnae (SRR401852) (Amemiya et al., 2013). As reference data sets we downloaded the publicly available muscle RNA-seq data sets from human (SRR545711) and zebrafish (ERR145647) from the sequence read archive. For library sizes see Supplementary Table S2. All data were paired-end reads sequenced with a comparable, non-strand-specific sequencing protocol. The raw reads with length of 101 nt were quality trimmed with FASTX-Toolkit version 0.0.13 (Lab, 2011) and adapter clipped with Cutadapt version 1.2.1 (Martin, 2011). For splice site discovery we mapped all available reads. In order to allow for a direct comparison of the relative abundance of circular and trans-spliced reads we down-sampled the data sets to approximately the same size. In this way, we avoid artifacts that are caused by the use of coverage thresholds for the detection of splice junctions. Otherwise, the number of detected junctions would increase in a poorly controlled manner with the size of the mapped library.
L. chalumnae genome annotation, described in Amemiya et al. (2013), was downloaded from ENSEMBL version 70.
Mapping and Splice Site Detection
We used segemehl version 0.1.4 (Hoffmann et al., 2009; Hoffmann et al., submitted) to map the reads onto the L. chalumnae genome allowing explicitly for split reads. Throughout this contribution we strictly distinguish between splice sites, defined as the genomic positions of a splice donor or splice acceptor, and a splice junction, defined as a pair of donor and acceptor positions spanned by an observed transcript. The splice sites reported by segemehl were then filtered by haarz, a component of the segemehl suite, in order to accumulate high confidence splice sites. To further reduce the chance of mapping artifacts, only splice junctions supported by at least three split reads were kept. Splice sites not included in one of these junctions were also removed from further analysis. The results of the filtering procedure are summarized in Supplementary Table S1.
We determined three types of splice junctions: (1) “normal” junctions with read fragments mapped colinearly with the genomic DNA to the same strand of the same scaffold and an insert size between 15 nt and 50 kb; (2) “circular” junctions on the same strand of the same scaffold with a distance less than 50 kb and with fragment order inverted relative to the genomic DNA; (3) “trans-splicing” junctions, where the two splice sites are located on different scaffolds. The relative orientation is of course irrelevant in this case. Spliced reads that connect two scaffolds can arise from normal, colinear splice events if the scaffolds are short or the splice sites are close to the ends of the scaffolds. In order to avoid contamination from such effects arising from the incompleteness of the genome assembly, we classified reads as trans-spliced only if those reads connect loci at least 50 kb from both ends of at least one of the two involved scaffolds.
Since a strand-unspecific RNA-seq protocol was used here, the reading direction of spliced reads can only be inferred indirectly. For reads splitting at canonical splice junctions we used MaxEntScan (Yeo and Burge, 2004) scores to compare the two putative reading directions. For both directions we computed the sum of the donor and acceptor score. If one direction had a positive sum, which was greater than the sum of the opposite direction plus 3, we defined this as the correct reading direction.
Variation within and between the two coelacanth species was quantified by determining SNPs of mapped transcriptome at all sites with a coverage of at least 8 reads. We used GATK version 2.3 (De Pristo et al., 2011) for SNP calling.
Circular Motif Search
In order to find a putative motif that is predominantly associated with circular junctions, we extracted 6 nt of DNA sequence at each splice site (3 nt in exon and intron) and combined it to form a 12 nt sequence pattern for each splice junction. This results in 5,561 unique patterns for 5,760 circular splice junctions and 27,311 unique patterns for 213,417 normal splice junctions. We employed MEME (Bailey and Elkan, 1995) for the motif search. It was run with the “zero or one match per sequence” option. As expected, the canonical splice junction motif was readily recovered. After removing about 700 twelve-mers that conform to a canonical or minor spliceosome motif, we again started the MEME motif search to see if any additional characteristic patterns could be detected. This was not the case.
SHSs and RTfacts
We analyzed to which extent the splice junctions of the data sets can be explained by the short homologous sequences model proposed in Li et al. (2009) or by RT PCR artifacts that show similar sequence homology (cf. Houseley and Tollervey, 2010). We thus computed the maximal length of the homologous subsequences between the exonic regions of the donor and acceptor splice sites. An exact overlap of at least 4 nt was counted as “short homologous sequence” (SHS), which may indicate an RTfact.
Coverage Estimation for Splice Junctions
In order to investigate the relationships between RNA expression and abundance of spliced RNA reads, we defined “coverage loci” as follows: We considered genomic regions with a minimum coverage of 8 reads and merged sites separated by less than 100 nt. Sites smaller than 50 nt were removed from further analysis. To account for inaccuracies in determining the boundaries of loci, we counted all spliced reads with a splice junction within 50 nt of a “coverage locus.”
Transcriptome Reconstruction and Identification of Novel lincRNAs
We used cufflinks version 2.0.2 (Roberts et al., 2011) to reconstruct possible transcripts together with their isoforms. The mapping output of segemehl was modified to fit the input requirements of cufflinks using a custom script. Separate transcript assemblies for both the complete L. chalumnae data set and the combined L. menadoensis data sets were merged together with cuffmerge as proposed by Trapnell et al. (2012). Overlaps between transcript and annotation data were computed with the help of BEDTools (Quinlan and Hall, 2010). In order to predict the coding potential of transcripts that were located at unannotated regions we applied RNAcode (Washietl et al., 2011) to the coelacanth-centric multiple alignment described in Amemiya et al. (2013). Transcripts were classified as potentially coding if at least half of their exons showed a minimum overlap of 50% with potentially coding regions. Transcripts that did not overlap with potentially coding regions were classified as potentially new lincRNAs. To confirm these lincRNA candidates they were compared against the non-redundant protein data base version (March 03, 2012) with tblastx (Altschul et al., 1997). Candidates that showed significant alignment hits were added to the potentially coding class. We operationally combined transcripts with the same reading direction separated by less than 5 kb into a single locus to account for the fact that many lincRNAs have rather low expression levels and thus may not be fully covered.
Conservation of Splice Sites
In order to obtain evidence for the conservation of gene structure we used the coelacanth-centered multiple sequence alignment (Amemiya et al., 2013) to find the exact sequence positions homologous to the splice sites in Latimeria. We then determined whether an experimentally known splice site is annotated in RefSeq or an EST data set at this position for any of the other eight aligned species (human, mouse, dog, opossum, lizard, stickleback, frog, or chicken). In the absence of experimental evidence we used MaxEntScan (Yeo and Burge, 2004). Following Nitsche et al. (in preparation), we used a score cutoff >3 for predicted splice sites.
The first-strand cDNA and genomic DNA from muscle of L. menadoensis was amplified by thermal cycling using the Takara ExTaq PCR kit (Takara, Japan). Primer pairs were designed to generate a PCR product that spanned the fusion site for these transcripts. Additional control primers were designed to amplify sequences present only in the local genomic contexts. Amplification was performed for 30 cycles at 94°C for 15 sec, 55°C for 30 sec, and 72°C for 1 min, with a final elongation for 8 min at 72°C. The amplified PCR products were sub-cloned into the pCRII-TOPO dual promoter vector (Invitrogen, Grand Island, NY, USA) and sequenced. Primer sequences and PCR products are listed in the Online Supplementary Tables S7 and S8.
Mapping and Variation
Between 75% and 80% of the reads in the individual RNA-seq data sets could be mapped to the reference genome. Between 1/6 and 1/5 of these mapped with splits, see Supplementary Table S2. In addition, sub-sampled libraries were mapped to obtain comparable sample sizes for quantifying circular and trans-spliced reads, see Supplementary Table S2.
The RNA-seq libraries of L. chalumnae muscle tissue and L. menadoensis liver and testis were of comparable size and quality, covering slightly more than 1% of the genome assembly. Using these transcriptome data as a reference, the two Latimeria species were very similar. The L. menadoensis transcripts showed only about 0.3% divergence from the L. chalumnae reference genome, while the number of heterozygous SNPs, that is, the intra-specific variation in L. menadoensis, was about twice as large. The number of homozygous differences between transcriptome and reference genome barely exceeded 0.1% and was consistent with about 0.4% heterozygous SNPs in L. chalumnae RNA-seq data (Table 1). The small divergence relative to the intra-specific diversity justified a joint analysis of all coelacanth transcriptome data in the following.
Splice Junctions and Transcripts
We made use of the enhanced sensitivity of segemehl in mapping split reads to extend the ENSEMBL 70.1 gene build for the LatCha1 assembly. The extreme similarity between the two coelacanth species, comparable to human and chimp, justified to combine the RNA-seq data for the purpose of constructing transcript models.
For the Latimeria chalumunae (muscle complete) RNA-seq data 26,176,970 reads were mapped with local, colinear splits. For the L. menadoensis (testis and liver) 14,201,048 reads were mapped. For the union of these sets 12,817,375 normal split reads that satisfied our filtering criteria were retained. Although the RNA-seq data had been produced with a non-strand-specific protocol, the reading direction could be determined with the help of MaxEntScan (Yeo and Burge, 2004) for 98.8% of these reads based on the canonical splice site motifs. This resulted in 270,957 unique splice sites, of which 208,956 exactly matched the splice sites of the ENSEMBL 70.1 gene build for the LatCha1 assembly (Fig. 1, Top). About 43% of the ENSEMBL splice junctions were not visible in our transcriptome map because the corresponding genes were not expressed at sufficient levels to pass our filtering criteria in the three tissues considered here. Additionally, 1,793 sites matched to splice junctions from the lincRNA set reported in the Supplementary Materials (Supplementary Data 1) of the coelacanth genome paper (Amemiya et al., 2013). Another 17,801 mapped to novel splice junctions within the boundaries of genes annotated in ENSEMBL 70.1 in the correct reading direction. Since they did not match exactly to positions of annotated splice sites of ENSEMBL 70.1, they are not shown in Figure 1 (Top). This left 42,463 novel splice sites located outside annotated genes, corresponding to 22,424 distinct splice junctions that are located entirely outside of annotation. Furthermore, we identified 3,360 distinct junctions with only one side outside the published annotation. A detailed comparison of observed splice junctions is compiled in Supplementary Tables S3 and S4, a graphical summary of the splice sites accounting for the exact matches only is given in Figure 1 (Top).
Assembled into transcripts with cufflinks and cuffmerge, these combined transcriptome data of L. chalumunae and L. menadoensis encompassed 126,235 distinct transcripts belonging to 109,761 genes. This amounts to an average of 2.54 exons. Of these, 86,203 (68.3%) transcripts were intronless. 61.9% of the transcripts (69,434) did not contain exons located within gene boundaries annotated by ENSEMBL. The majority of these, namely 58,058 transcripts, were intronless.
About 87% (60,444) of these new transcripts can be considered as lincRNAs since they have no overlaps with RNAcode hits or blastx hits in the CCDS data base with an E-value <10−10. About 18% of the rest, that is, 1,586 new transcripts can be classified as potentially coding genes, since at least half of their exons overlap by at least 50% of their sequence with blastx alignments or with regions found by RNAcode. If strand information was available, the overlap had to be strand-specific.
We found 22,424 novel splice junctions outside the published annotation corresponding to 41,139 unique splice sites. Of these, 32,467 matched exactly with splice sites in the collated transcript models produced by cuffmerge. 4,163 additional splice sites were located within these transcripts, apparently corresponding to local variations in the exact splicing position.
It should be noted that a substantial fraction of splice sites from the raw data were not incorporated into transcript models by cufflinks. This explains, for example, why part of the splice sites in the lincRNAs annotated in Amemiya et al. (2013) are not recovered in our analysis.
An overview of the transcriptome analysis relative to the previously available annotation is given in Figure 1 (Bottom), where transcripts were merged into loci according to a 5 kb window. Overall, we report here 50,644 novel expressed loci that were overlooked in previous analyses of the same data sets. Of these, 30,268 contain spliced transcripts. The vast majority of newly identified transcripts is non-coding. Nevertheless, we were able to identify more than 500 additional loci with coding capacity.
Conservation of Transcript Structure
Of the 270,957 canonical splice sites in the combined data set, which includes 208,956 sites matching to ENSEMBL annotation (Supplementary Table S3), about 77.8% were alignable in at least one of eight other vertebrate genomes. More than 96% of these were conserved according to splice site scores, and for 92.7% there was experimental evidence for a functional splice site in at least one of these eight species (Supplementary Table S5). The overwhelming majority of these splice sites were located within protein-coding genes.
We observed 23,065 splice sites in 8,066 spliced lincRNAs in the union of our lincRNAs and the lincRNAs reported in the coelacanth genome paper (Amemiya et al., 2013). About 14% of the splice sites (1,839 sites in 1,135 transcripts) in this combined lincRNA set were alignable to sequence in at least one of the other eight vertebrate genomes included in the latimeria-centered MSA (Table 2). Of these, 40% exactly correspond to an annotated splice site in at least one of these species, providing direct evidence for the partial conservation of 301 lincRNA loci (merged from 391 transcripts).
|H or M||540||292||374||166|
The rather poor conservation of lincRNAs as measured by splice sites does not come as a surprise, since only a small fraction of the observed splice junctions were included in the multiple sequence alignmens in the first place. Their level of sequence conservation was very low compared to other functional transcripts (Pang et al., 2006; Marques and Ponting, 2009), although there is good evidence that, at least as a group, mRNA-like non-coding RNAs are under stabilizing selection (Ponjavic et al., 2007; Guttman et al., 2009; Marques and Ponting, 2009; Young et al., 2012).
For L. menadoensis and L. chalumnae we observed 5,760 circularizing junctions and 17,066 trans-splicing junctions. For a fraction of 10.6% and 28.7%, respectively, we were able to determine a reading direction, based on canonical splice motifs. Thus 610 circular junctions remain, consisting of 1,120 canonical splice sites. Almost half of these splice sites (501) are also utilized in regular, colinear splice junctions. They are surprisingly well conserved: more than 60% are located in a region that is alignable in at least one other distant vertebrate and more than a third of these positions constitute a functional splice site according to the available experimental evidence, see Table 3. A comparison of circularizing splice junctions with recent reports of circular microRNA sponges in the human transcriptome (Hansen et al., 2013; Memczak et al., 2013) did not provide evidence for the conservation of these particular RNAs between mammals and coelacanth, however.
|Species||No. of unique splice sites|
|H or M||296||126||115||147|
In the combined Latimeria RNA-seq data we found 17,066 trans-splice junctions connecting different scaffolds. Among these are 338 that are backed by more than 100 split reads. The majority of these splice sites were unique to trans-splicing events.
Table 4 summarizes the conservation of the trans-splicing sites. Only a third of them could be aligned to homologous sequences in other vertebrates. In most of these cases we observed a functional splice site in the other species. However, in general, the specificity for non-local junctions does not appear to be as conserved across species as other splice sites.
|Species||No. of unique splice sites|
|H or M||2,023||1,738||1,746||1,781|
While circular transcripts have been validated in several studies and we are beginning to understand their regulatory functions, it seems that the reality of large numbers of trans-splicing-like RNAs is still not universally accepted. We therefore selected three examples for direct validation by PCR, see Figure 2 for details.
Comparison of Normal and Atypical Transcripts
For a better comparison of the properties of circular and trans-spliced transcripts in the individual data sets we used sub-samples of equal size. In this way we obtained a comparable sequencing depth, which should at least alleviate the biases arising from very rare junctions in the largest data sets. While this simple normalization cannot account for differences in the expression profiles of the different tissues it should at least make the data sets qualitatively comparable.
Results for the two coelacanth species are very similar as shown in Supplementary Figure S1, hence we use their union. Figure 3 compares the atypical reads with normal (local and colinear) splice events for coelacanth, human, and zebrafish RNA libraries. As expected, the overwhelming majority of normal splice events utilizes canonical splice patterns. In contrast, circular, and trans-splice (long-range) events often use alternative sequence patterns, although a substantial fraction still conforms to the canonical motifs. We observed that in the coelacanth data, more circularizing splice junctions are off by 1 or 2 nt compared to both the human and the zebrafish data set. Adding these to the canonical subset (distance 0 in the left hand side panels of Fig. 3) yielded nearly the same fraction of about 70–80% canonical splice motifs as zebrafish and human. We note that this fraction is substantially larger than the numbers reported in Li et al. (2009). Surprisingly, however, the right-most two columns of Figure 3 shows that most of the circularizing and trans-joining splice junctions are disjoint from normal splice junctions. This effect is even more pronounced in coelacanth and zebrafish than in human. This pattern, which we observe for both the circularizing and the trans-junctions strongly suggests that the resulting unconventional transcripts are not merely a by-product of conventional, local splicing events.
Since a substantial fraction of the circular and trans-splice junctions did not fit the canonical splice site motif, we searched for additional over-represented patterns in the remaining junctions. No significant pattern could be identified, however. We then searched for the “short homologous sequences,” that is, short sequences with four or more nucleotides, shared by the sequences surrounding the “splice junction.” According to Houseley and Tollervey (2010), however, these might be RTfacts. Supplementary Table S6 shows that such patterns are rare in our data, ranging from 0.7% to 2.6% of the circularized or trans-spliced transcripts. At the same time, the majority of atypical junctions are associated with canonical splice site motifs. We thus conclude that contamination levels in our data are low and the majority of both circular and trans-spliced RNAs cannot be explained as technical artifacts.
In Figure 4, we summarize the correlation between the coverage of a given locus and the abundance of splice events. For the normal splice events we clearly observed two populations. Along the x-axis we record a large number of rare splice junctions whose occurrence shows no correlation with the expression level. Along the main diagonal, on the other hand, efficiently spliced transcripts are recorded. Here the number of spliced reads essentially equals the coverage of the locus. For circular and trans-spliced loci we also observed a separation into two populations. The number of circular and trans-spliced reads, however, is typically only a fraction of the coverage of the locus. Most circular and trans-spliced transcripts thus originated from loci that also produce more conventional isoforms. Interestingly, measured in terms of RNA levels, “aberrant” isoforms were much more abundant in coelacanths than in zebrafish and human.
Atypical transcripts, defined here as those that do not map locally and colinearly to the genomes, have only very recently been recognized as a major component of the transcriptome. We analyze here in detail the available RNA-seq data of two coelacanth species, L. chalumnae and L. menadoensis. With the help of an improved mapping algorithm implemented in segemehl (Hoffmann et al., submitted) that deals efficiently with both typical and atypical transcripts, we give here a detailed overview of the coelacanth transcriptome. In particular we report 51,488 additional expressed loci from which normal transcripts arise (576 protein-coding and 37,099 lincRNAs), together with 362 splice sites of circular RNAs and 4,698 of long-range (trans-spliced) connections. The very high fraction of junctions that use canonical splice sites strongly suggests that the overwhelming majority of these transcripts cannot be explained as “RTfacts” and must be interpreted as biological reality.
A particularly interesting feature is the unexpectedly high level of evolutionary conservation of splice sites involved in circularization. The majority of these sites does not appear in normal transcripts. Their conservation suggests that they are of functional importance. Recent reports of abundant, stable, and often conserved circular RNAs in mammals have identified them as an important class of regulatory molecules (Hansen et al., 2013; Jeck et al., 2013; Memczak et al., 2013). Our results strongly indicate that such “atypical” transcripts are evolutionarily old, dating back at least to an osteichthyan ancestor. Non-local trans-spliced transcripts are even less-well understood. The statistical similarities in splice site usage and conservation between trans-spliced and circularized products, suggests that at least a subset is also functional. This observation is further strengthened by evolutionary conservation of at least a fraction of the non-local trans-splice sites, albeit a smaller fraction than with the circularizing sites. Future exploration of the functional significance of “atypical” transcripts, such as these, promises to yield many new insights.
LIFE—Leipzig Research Center for Civilization Diseases, Universität Leipzig is funded by means of the European Social Fund and the Free State of Saxony.
- 2011. Post-transcriptional exon shuffling events in humans can be evolutionarily conserved and abundant. Genome Res 21:1788–1799. , , , et al.
- 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. , , , et al.
- 2013. The african coelacanth genome provides insights into tetrapod evolution. Nature 496:311–316. , , , et al.
- 1995. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn 21:51–80. , .
- 2010. Expression of linear and novel circular forms of an INK4/ARF-associated non-coding RNA correlates with atherosclerosis risk. PLoS Genet 6:e1001233. , , , et al.
- 1998. Exon scrambling of MLL transcripts occur commonly and mimic partial genomic duplication of the gene. Gene 208:167–176. , , , et al.
- 1998. Natural trans-splicing in carnitine octanoyltransferase pre-mRNAs in rat liver. Proc Natl Acad Sci USA 95:12185–12190. , , , et al.
- 2006. Reverse transcriptase template switching and false alternative transcripts. Genomics 88:127–131. , , , .
- 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498. , , , et al.
- 2005. A genome-wide survey demonstrates widespread non-linear mRNA in expressed sequences from multiple species. Nucleic Acids Res 33:5904–5913. , , , .
- 2012. Evidence for transcript networks composed of chimeric RNAs in human cells. PLoS ONE 7:e28213. , , , et al.
- 2001. Transgene analysis proves mRNA trans-splicing at the complex mod(mdg4) locus in drosophila. Proc Natl Acad Sci USA 98:9724–9729. , , .
- 1999. Exon repetition in mRNA. Proc Natl Acad Sci USA 96:5400–5405. , , , et al.
- 2012. Novel domain combinations in proteins encoded by chimeric transcripts. Bioinformatics 28:i67–i74. , .
- 2012. Chimeras taking shape: potential functions of proteins encoded by chimeric rna transcripts. Genome Res 22:1231–1242. , , , et al.
- 2013. ChiTaRS: a database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data. Nucleic Acids Res 41:D142–D151. , , , et al.
- 2011. Implications of chimaeric non-co-linear transcripts. Nature 461:206–211. .
- 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458:223–227. , , , et al.
- 2013. Natural RNA circles function as efficient microRNA sponges. Nature 495:384–388. , , , et al.
- 2009. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol 5:e1000502. , , , et al.
- 2010. Apparent non-canonical trans-splicing is generated by reverse transcriptase in vitro. PLoS ONE 5:e12271. , .
- 2013. Circular RNAs are abundant, conserved, and associated with ALU repeats. RNA 19:141–157. , , , et al.
- 2011. FASTX Toolkit. Available online at: http://hannonlab.cshl.edu/fastx_toolkit/ [accessed 01/02/2013]. .
- 2003. Extensive exon reshuffling over evolutionary time coupled to trans-splicing in Drosophila. Genome Res 13:2220–2228. , .
- 1999. Human acyl-CoA:cholesterol acyltransferase-1 (ACAT-1) gene organization and evidence that the 4.3-kilobase ACAT-1 mRNA is produced from two different chromosomes. J Biol Chem 274:11060–11071. , , , et al.
- 2009. Short homologous sequences are strongly associated with the generation of chimeric RNAs in eukaryotes. J Mol Evol 68:56–65. , , , .
- 2012. Identification and analysis of pig chimeric mRNAs using RNA sequencing data. BMC Genomics 13:429. , , , et al.
- 2009. Catalogues of mammalian long noncoding RNAs: modest conservation and incompleteness. Genome Biol 10:R124. , .
- 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 17(1). Available online at: http://journal.embnet.org/index.php/embnetjournal/article/view/200 .
- 2010. Global analysis of trans-splicing in Drosophila. Proc Natl Acad Sci USA 107:12975–12979. , , , .
- 2013. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature 495:333–338. , , , et al.
- 2013. Analysis of the transcriptome of the Indonesian coelacanth Latimeria menadoensis. BMC Genomics 14:538. , , , et al.
- 2006. Rapid evolution of noncoding RNAs: lack of conservation does not mean lack of function. Trends Genet 22:1–5. , , .
- 2007. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. Genome Res 17:556–565. , , .
- 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics (Oxford, England) 26:841–842. , .
- 2004. Exon repetition: a major pathway for processing mRNA of some genes is allele-specific. Nucleic Acids Res 32:441–446. , , , .
- 2011. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics (Oxford, England) 27:2325–2329. , , , .
- 2008. When good transcripts go bad: artifactual RT-PCR ‘splicing’ and genome analysis. Bioessays 30:601–605. , .
- 2012. Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PLoS ONE 7:e30733. , , , , .
- 1999. Circular dystrophin RNAs consisting of exons that were skipped by alternative splicing. Hum Mol Genet 8:493–500. , , , et al.
- 2006. Characterization of rnase R-digested cellular RNA source that consists of lariat and circular RNAs from pre-mRNA splicing. Nucleic Acids Res 34:e63. , , , et al.
- 2012. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7:562–578. , , , , .
- 2008. Proximity-dependent and proximity-independent trans-splicing in mammalian cells. RNA 14:1081–1094. , .
- 2011. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17:578–594. , , , et al.
- 2004. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 11:377–394. , .
- 2012. Identification and properties of 1, 119 candidate lincRNA loci in the Drosophila melanogaster genome. Genome Biol Evol 4:427–442. , , , et al.
- 1996. Circular RNAs from transcripts of the rat cytochrome P450 2C24 gene: correlation with exon skipping. Proc Natl Acad Sci USA 93:6536–6541. .
- 2011. Trans-splicing in higher eukaryotes: implications for cancer development? Front Genet 2:92. .
Additional supporting information may be found in the online version of this article at the publisher's web-site.
Figure S1. Comparison of splice junctions in normal (local and co-linear), circular, and trans-spliced reads. The ﬁrst three columns of the panel summarizes the distribution of distances of predicted splice junctions from the closest site with a canonical donor or acceptor motif (gt–ag). The remaining two columns show the distances to the nearest site that also harbours normally spliced reads.
Table S1. Split numbers.
Table S2. Summary of mapped reads.
Table S3. Comparison of observed splice junctions in L. chalumnae and L. menadoensis, resulting from reads mapped with local, colinear splits. The addition “filtered” describes the filtering by haarz mapping criteria and a minimum junction support of three reads.
Table S4. Relation of splice junctions to annotation.
Table S5. Conservation of normal splice sites.
Table S6. SHS-motives in circular and trans-spliced RNAs are indicative of “RTfacts.”
Table S7. Primers used to validate putative fused transcripts.
Table S8. Sequences for subcloned PCR products from validation experiment.
Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.