SEARCH

SEARCH BY CITATION

ABSTRACT

  1. Top of page
  2. ABSTRACT
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. ACKNOWLEDGMENTS
  7. LITERATURE CITED
  8. Supporting Information

Circular and apparently trans-spliced RNAs have recently been reported as abundant types of transcripts in mammalian transcriptome data. Both types of non-colinear RNAs are also abundant in RNA-seq of different tissue from both the African and the Indonesian coelacanth. We observe more than 8,000 lincRNAs with normal gene structure and several thousands of circularized and trans-spliced products, showing that such atypical RNAs form a substantial contribution to the transcriptome. Surprisingly, the majority of the circularizing and trans-connecting splice junctions are unique to atypical forms, that is, are not used in normal isoforms. J. Exp. Zool. (Mol. Dev. Evol.) 322B: 342–351, 2014. © 2013 Wiley Periodicals, Inc.

Coelacanths, as the most basal branch of the sarcopterygian lineage, have the potential to shed light on many of the features of their ancient relatives. The genomic sequence of the African species, Latimeria chalumnae, has revealed a retarded evolutionary rate for its protein-coding sequences compared to land vertebrates, while most other genomic features evolve at comparable speed. Many salient changes relative to osteichthyan genomes can be attributed to the the vertebrate adaptation to land (Amemiya et al., 2013). Here, we will be concerned with global patterns of the coelacanth's transcriptome.

High throughput transcriptome sequencing provides a view on the RNA content of a sample in unprecedented depth and detail. Although the technology as such promises largely unbiased data, it requires elaborate processing of raw data. It is at this step that preconceptions about what we expect to see in a transcriptome can guide quality control and noise filtering procedures. As a result, these are more often than not prone to ignoring those parts of the data set that do not fit to the established paradigms. In this contribution, we therefore focus on this blind spot and investigate in detail those reads that do not map locally and colinearly to their reference genome. In fact, several classes of “atypical” transcripts have been observed in previous studies.

A substantial fraction of the spliced human transcripts from hundreds of loci are circularized (Salzman et al., 2012; Jeck et al., 2013). Some prominent examples of this class indeed have been known for more than a decade (see, e.g., Zaphiropoulos, 1996; Caldas et al., 1998; Surono et al., 1999). Until recently, however, they have received little attention. As no function was known for them they were usually considered failures of the splicing machinery. Burd et al. (2010), however, reported a specific association of circular isoforms of the ANRIL ncRNA with atherosclerosis risk. RNase R treatment to digest linear RNAs has demonstrated that the circularized products cannot be explained as “RTfacts” (Suzuki et al., 2006; Jeck et al., 2013). Several interesting genes have circular isoforms conserved between human and mouse (Hansen et al., 2013; Jeck et al., 2013; Memczak et al., 2013), furthermore, demonstrated that circular transcripts may function as microRNA sponges. Reports of “exon shuffling” (Al-Balool et al., 2011) appearently refer to circular RNAs of which only a part is observed as PCR product. A presumably related, even less-well studied phenomenon is exon repetition (Frantz et al., 1999; Rigatti et al., 2004), which also affects hundreds of loci in mammalian genomes (Dixon et al., 2005). Like circularization, it occurs locally within the boundaries of given genes. Local strand-switching has also been observed in a few peculiar cases, such as the mod(mdg4) locus in Drosophila (Dorn et al., 2001), where complex trans-splicing may have evolved in fruitflies as a response to local DNA rearrangements (Labrador and Corces, 2003).

Trans-splicing, that is, the joining of independently transcribed parts, often from distant loci, is a well-known mechanism in organisms such as typanosomatids or nematodes but has been generally thought not to play a role, or at least not to be a wide-spread phenomenon, in vertebrates (Zaphiropoulos, 2011). Nevertheless, several conspicuous examples have been known since the end of the last millenium (Caudevilla et al., 1998; Li et al., 1999). Artificial constructs that efficiently undergo trans-splicing are described, for example, by Viles and Sullenger (2008). Large-scale transcriptome sequencing in a variety of species, including human, indicates an abundance of non-colinear RNAs, with pieces deriving from distant genomic regions and even different chromosomes (Gingeras, 2011). Chimeric mRNAs in pigs with a focus on comparing expression across individuals and breeds, were studied in detail by Ma et al. (2012). High tissue specificity of chimeric transcripts is reported by Frenkel-Morgenstern et al. (2012) together with evidence that several of these transcripts give rise to proteins detectable by multiple shotgun mass spectrometry. A data base of chimeric transcripts in human, mouse, and fly has recently become available (Frenkel-Morgenstern et al., 2013).

Li et al. (2009) observed that less than 20% of chimeric transcripts have canonical splice sites. About half of the instances instead feature small homologous sequences at the junction. “RTfacts” that are generated by reverse transcription and mimic spliced RNAs during cDNA synthesis, however, also involve short direct repeats (Cocquet et al., 2006; Roy and Irimia, 2008; Houseley and Tollervey, 2010), which are brought into close proximity by complementary base pairing at the ends of the “intron.” A detailed analysis of trans-splicing in fruitflies concludes that most, if not all, apparent chimeric RNA products from distant loci are artifacts, while trans-splicing between homologous alleles occurs frequently (McManus et al., 2010). In contrast, Djebali et al. (2012) observe several statistical signatures of chimeric transcripts in the ENCODE data, in particular a correlation with spatial proximity in vivo that strongly support a biological origin of these RNAs. Another line of evidence in support of the reality of chimeric transcripts is the observation that coding sequences on fusion transcripts contain complete protein domains significantly more often than in random control data sets (Frenkel-Morgenstern and Valencia, 2012).

We are interested here in particular to what extent “atypical” transcripts observed in mammals are also prevalent in coelacanth. To this end we re-analyze transcriptomic data sets for L. chalumnae (Amemiya et al., 2013) and Latimeria menadoensis (Pallavicini et al., 2013). In addition, we use the increased sensitivity of the mapping procedures used here to refine and expand the annotation of splice junctions, leading to the annotation of a large number of novel putative lincRNAs.

METHODS

  1. Top of page
  2. ABSTRACT
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. ACKNOWLEDGMENTS
  7. LITERATURE CITED
  8. Supporting Information

Data Sets

In this work four transcriptome data sets have been analyzed. Coelacanth RNA-seq samples were obtained based on liver (SRR576100) and testis tissue (SRR576101) from a single individual of L. menadoensis (Pallavicini et al., 2013) and muscle tissue of a specimen of L. chalumnae (SRR401852) (Amemiya et al., 2013). As reference data sets we downloaded the publicly available muscle RNA-seq data sets from human (SRR545711) and zebrafish (ERR145647) from the sequence read archive. For library sizes see Supplementary Table S2. All data were paired-end reads sequenced with a comparable, non-strand-specific sequencing protocol. The raw reads with length of 101 nt were quality trimmed with FASTX-Toolkit version 0.0.13 (Lab, 2011) and adapter clipped with Cutadapt version 1.2.1 (Martin, 2011). For splice site discovery we mapped all available reads. In order to allow for a direct comparison of the relative abundance of circular and trans-spliced reads we down-sampled the data sets to approximately the same size. In this way, we avoid artifacts that are caused by the use of coverage thresholds for the detection of splice junctions. Otherwise, the number of detected junctions would increase in a poorly controlled manner with the size of the mapped library.

L. chalumnae genome annotation, described in Amemiya et al. (2013), was downloaded from ENSEMBL version 70.

Mapping and Splice Site Detection

We used segemehl version 0.1.4 (Hoffmann et al., 2009; Hoffmann et al., submitted) to map the reads onto the L. chalumnae genome allowing explicitly for split reads. Throughout this contribution we strictly distinguish between splice sites, defined as the genomic positions of a splice donor or splice acceptor, and a splice junction, defined as a pair of donor and acceptor positions spanned by an observed transcript. The splice sites reported by segemehl were then filtered by haarz, a component of the segemehl suite, in order to accumulate high confidence splice sites. To further reduce the chance of mapping artifacts, only splice junctions supported by at least three split reads were kept. Splice sites not included in one of these junctions were also removed from further analysis. The results of the filtering procedure are summarized in Supplementary Table S1.

We determined three types of splice junctions: (1) “normal” junctions with read fragments mapped colinearly with the genomic DNA to the same strand of the same scaffold and an insert size between 15 nt and 50 kb; (2) “circular” junctions on the same strand of the same scaffold with a distance less than 50 kb and with fragment order inverted relative to the genomic DNA; (3) “trans-splicing” junctions, where the two splice sites are located on different scaffolds. The relative orientation is of course irrelevant in this case. Spliced reads that connect two scaffolds can arise from normal, colinear splice events if the scaffolds are short or the splice sites are close to the ends of the scaffolds. In order to avoid contamination from such effects arising from the incompleteness of the genome assembly, we classified reads as trans-spliced only if those reads connect loci at least 50 kb from both ends of at least one of the two involved scaffolds.

Since a strand-unspecific RNA-seq protocol was used here, the reading direction of spliced reads can only be inferred indirectly. For reads splitting at canonical splice junctions we used MaxEntScan (Yeo and Burge, 2004) scores to compare the two putative reading directions. For both directions we computed the sum of the donor and acceptor score. If one direction had a positive sum, which was greater than the sum of the opposite direction plus 3, we defined this as the correct reading direction.

Variation Calling

Variation within and between the two coelacanth species was quantified by determining SNPs of mapped transcriptome at all sites with a coverage of at least 8 reads. We used GATK version 2.3 (De Pristo et al., 2011) for SNP calling.

Circular Motif Search

In order to find a putative motif that is predominantly associated with circular junctions, we extracted 6 nt of DNA sequence at each splice site (3 nt in exon and intron) and combined it to form a 12 nt sequence pattern for each splice junction. This results in 5,561 unique patterns for 5,760 circular splice junctions and 27,311 unique patterns for 213,417 normal splice junctions. We employed MEME (Bailey and Elkan, 1995) for the motif search. It was run with the “zero or one match per sequence” option. As expected, the canonical splice junction motif was readily recovered. After removing about 700 twelve-mers that conform to a canonical or minor spliceosome motif, we again started the MEME motif search to see if any additional characteristic patterns could be detected. This was not the case.

SHSs and RTfacts

We analyzed to which extent the splice junctions of the data sets can be explained by the short homologous sequences model proposed in Li et al. (2009) or by RT PCR artifacts that show similar sequence homology (cf. Houseley and Tollervey, 2010). We thus computed the maximal length of the homologous subsequences between the exonic regions of the donor and acceptor splice sites. An exact overlap of at least 4 nt was counted as “short homologous sequence” (SHS), which may indicate an RTfact.

Coverage Estimation for Splice Junctions

In order to investigate the relationships between RNA expression and abundance of spliced RNA reads, we defined “coverage loci” as follows: We considered genomic regions with a minimum coverage of 8 reads and merged sites separated by less than 100 nt. Sites smaller than 50 nt were removed from further analysis. To account for inaccuracies in determining the boundaries of loci, we counted all spliced reads with a splice junction within 50 nt of a “coverage locus.”

Transcriptome Reconstruction and Identification of Novel lincRNAs

We used cufflinks version 2.0.2 (Roberts et al., 2011) to reconstruct possible transcripts together with their isoforms. The mapping output of segemehl was modified to fit the input requirements of cufflinks using a custom script. Separate transcript assemblies for both the complete L. chalumnae data set and the combined L. menadoensis data sets were merged together with cuffmerge as proposed by Trapnell et al. (2012). Overlaps between transcript and annotation data were computed with the help of BEDTools (Quinlan and Hall, 2010). In order to predict the coding potential of transcripts that were located at unannotated regions we applied RNAcode (Washietl et al., 2011) to the coelacanth-centric multiple alignment described in Amemiya et al. (2013). Transcripts were classified as potentially coding if at least half of their exons showed a minimum overlap of 50% with potentially coding regions. Transcripts that did not overlap with potentially coding regions were classified as potentially new lincRNAs. To confirm these lincRNA candidates they were compared against the non-redundant protein data base version (March 03, 2012) with tblastx (Altschul et al., 1997). Candidates that showed significant alignment hits were added to the potentially coding class. We operationally combined transcripts with the same reading direction separated by less than 5 kb into a single locus to account for the fact that many lincRNAs have rather low expression levels and thus may not be fully covered.

Conservation of Splice Sites

In order to obtain evidence for the conservation of gene structure we used the coelacanth-centered multiple sequence alignment (Amemiya et al., 2013) to find the exact sequence positions homologous to the splice sites in Latimeria. We then determined whether an experimentally known splice site is annotated in RefSeq or an EST data set at this position for any of the other eight aligned species (human, mouse, dog, opossum, lizard, stickleback, frog, or chicken). In the absence of experimental evidence we used MaxEntScan (Yeo and Burge, 2004). Following Nitsche et al. (in preparation), we used a score cutoff >3 for predicted splice sites.

Validation Experiments

The first-strand cDNA and genomic DNA from muscle of L. menadoensis was amplified by thermal cycling using the Takara ExTaq PCR kit (Takara, Japan). Primer pairs were designed to generate a PCR product that spanned the fusion site for these transcripts. Additional control primers were designed to amplify sequences present only in the local genomic contexts. Amplification was performed for 30 cycles at 94°C for 15 sec, 55°C for 30 sec, and 72°C for 1 min, with a final elongation for 8 min at 72°C. The amplified PCR products were sub-cloned into the pCRII-TOPO dual promoter vector (Invitrogen, Grand Island, NY, USA) and sequenced. Primer sequences and PCR products are listed in the Online Supplementary Tables S7 and S8.

RESULTS

  1. Top of page
  2. ABSTRACT
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. ACKNOWLEDGMENTS
  7. LITERATURE CITED
  8. Supporting Information

Mapping and Variation

Between 75% and 80% of the reads in the individual RNA-seq data sets could be mapped to the reference genome. Between 1/6 and 1/5 of these mapped with splits, see Supplementary Table S2. In addition, sub-sampled libraries were mapped to obtain comparable sample sizes for quantifying circular and trans-spliced reads, see Supplementary Table S2.

The RNA-seq libraries of L. chalumnae muscle tissue and L. menadoensis liver and testis were of comparable size and quality, covering slightly more than 1% of the genome assembly. Using these transcriptome data as a reference, the two Latimeria species were very similar. The L. menadoensis transcripts showed only about 0.3% divergence from the L. chalumnae reference genome, while the number of heterozygous SNPs, that is, the intra-specific variation in L. menadoensis, was about twice as large. The number of homozygous differences between transcriptome and reference genome barely exceeded 0.1% and was consistent with about 0.4% heterozygous SNPs in L. chalumnae RNA-seq data (Table 1). The small divergence relative to the intra-specific diversity justified a joint analysis of all coelacanth transcriptome data in the following.

Table 1. Variation analysis of transcript libraries
LibraryCoverageVariation
nt%Het.Hom.
  1. The variation columns give the number of heterozygous SNPs (het.) and homozygous SNPs per 106 nt of genomic DNA estimated by GATK (De Pristo et al., 2011).

Lm liver30,277,3891.066,2472,944
Lm testis38,881,6741.367,4752,813
Lc muscle37,702,8831.324,4001,378

Splice Junctions and Transcripts

We made use of the enhanced sensitivity of segemehl in mapping split reads to extend the ENSEMBL 70.1 gene build for the LatCha1 assembly. The extreme similarity between the two coelacanth species, comparable to human and chimp, justified to combine the RNA-seq data for the purpose of constructing transcript models.

For the Latimeria chalumunae (muscle complete) RNA-seq data 26,176,970 reads were mapped with local, colinear splits. For the L. menadoensis (testis and liver) 14,201,048 reads were mapped. For the union of these sets 12,817,375 normal split reads that satisfied our filtering criteria were retained. Although the RNA-seq data had been produced with a non-strand-specific protocol, the reading direction could be determined with the help of MaxEntScan (Yeo and Burge, 2004) for 98.8% of these reads based on the canonical splice site motifs. This resulted in 270,957 unique splice sites, of which 208,956 exactly matched the splice sites of the ENSEMBL 70.1 gene build for the LatCha1 assembly (Fig. 1, Top). About 43% of the ENSEMBL splice junctions were not visible in our transcriptome map because the corresponding genes were not expressed at sufficient levels to pass our filtering criteria in the three tissues considered here. Additionally, 1,793 sites matched to splice junctions from the lincRNA set reported in the Supplementary Materials (Supplementary Data 1) of the coelacanth genome paper (Amemiya et al., 2013). Another 17,801 mapped to novel splice junctions within the boundaries of genes annotated in ENSEMBL 70.1 in the correct reading direction. Since they did not match exactly to positions of annotated splice sites of ENSEMBL 70.1, they are not shown in Figure 1 (Top). This left 42,463 novel splice sites located outside annotated genes, corresponding to 22,424 distinct splice junctions that are located entirely outside of annotation. Furthermore, we identified 3,360 distinct junctions with only one side outside the published annotation. A detailed comparison of observed splice junctions is compiled in Supplementary Tables S3 and S4, a graphical summary of the splice sites accounting for the exact matches only is given in Figure 1 (Top).

image

Figure 1. Overview of splice sites and “loci” in comparison to the existing annotation. (Top) Venn diagram of unique single splice site positions, detected in our colinear mapped split reads (“normal splice sites”), annotated by ENSEMBL and reported in lincRNAs identified in the main paper (Amemiya et al., 2013). (Bottom) Venn diagram comparing ENSEMBL gene annotation with expressed loci from our mapping data. Transcripts with a distance less then 5 kb to each other were merged to one loci, resulting in 69,579 loci. The intersection shows the number of loci, which overlap gene boundaries annotated by ENSEMBL. The distinction of these loci into coding and non-coding is determined by the biotype of the respective overlapping genes.

Download figure to PowerPoint

Assembled into transcripts with cufflinks and cuffmerge, these combined transcriptome data of L. chalumunae and L. menadoensis encompassed 126,235 distinct transcripts belonging to 109,761 genes. This amounts to an average of 2.54 exons. Of these, 86,203 (68.3%) transcripts were intronless. 61.9% of the transcripts (69,434) did not contain exons located within gene boundaries annotated by ENSEMBL. The majority of these, namely 58,058 transcripts, were intronless.

About 87% (60,444) of these new transcripts can be considered as lincRNAs since they have no overlaps with RNAcode hits or blastx hits in the CCDS data base with an E-value <10−10. About 18% of the rest, that is, 1,586 new transcripts can be classified as potentially coding genes, since at least half of their exons overlap by at least 50% of their sequence with blastx alignments or with regions found by RNAcode. If strand information was available, the overlap had to be strand-specific.

We found 22,424 novel splice junctions outside the published annotation corresponding to 41,139 unique splice sites. Of these, 32,467 matched exactly with splice sites in the collated transcript models produced by cuffmerge. 4,163 additional splice sites were located within these transcripts, apparently corresponding to local variations in the exact splicing position.

It should be noted that a substantial fraction of splice sites from the raw data were not incorporated into transcript models by cufflinks. This explains, for example, why part of the splice sites in the lincRNAs annotated in Amemiya et al. (2013) are not recovered in our analysis.

An overview of the transcriptome analysis relative to the previously available annotation is given in Figure 1 (Bottom), where transcripts were merged into loci according to a 5 kb window. Overall, we report here 50,644 novel expressed loci that were overlooked in previous analyses of the same data sets. Of these, 30,268 contain spliced transcripts. The vast majority of newly identified transcripts is non-coding. Nevertheless, we were able to identify more than 500 additional loci with coding capacity.

Conservation of Transcript Structure

Of the 270,957 canonical splice sites in the combined data set, which includes 208,956 sites matching to ENSEMBL annotation (Supplementary Table S3), about 77.8% were alignable in at least one of eight other vertebrate genomes. More than 96% of these were conserved according to splice site scores, and for 92.7% there was experimental evidence for a functional splice site in at least one of these eight species (Supplementary Table S5). The overwhelming majority of these splice sites were located within protein-coding genes.

We observed 23,065 splice sites in 8,066 spliced lincRNAs in the union of our lincRNAs and the lincRNAs reported in the coelacanth genome paper (Amemiya et al., 2013). About 14% of the splice sites (1,839 sites in 1,135 transcripts) in this combined lincRNA set were alignable to sequence in at least one of the other eight vertebrate genomes included in the latimeria-centered MSA (Table 2). Of these, 40% exactly correspond to an annotated splice site in at least one of these species, providing direct evidence for the partial conservation of 301 lincRNA loci (merged from 391 transcripts).

Table 2. Conservation of splice sites of coelacanth lincRNAs
SpeciesSplice sitesTranscripts
Coelacanth23,0658,066
Align.Annot.Align.Annot.
  1. “H or M,” human or mouse, “8 Species” means that the splice site is present in at least one of the eight other vertebrates in the latimeria-centered 9-way multiple sequence alignment.

Human447254310146
Mouse350195253117
H or M540292374166
Frog823334514190
Stickleback3157922952
8 Species1,8397331,135394

The rather poor conservation of lincRNAs as measured by splice sites does not come as a surprise, since only a small fraction of the observed splice junctions were included in the multiple sequence alignmens in the first place. Their level of sequence conservation was very low compared to other functional transcripts (Pang et al., 2006; Marques and Ponting, 2009), although there is good evidence that, at least as a group, mRNA-like non-coding RNAs are under stabilizing selection (Ponjavic et al., 2007; Guttman et al., 2009; Marques and Ponting, 2009; Young et al., 2012).

Circularized Transcripts

For L. menadoensis and L. chalumnae we observed 5,760 circularizing junctions and 17,066 trans-splicing junctions. For a fraction of 10.6% and 28.7%, respectively, we were able to determine a reading direction, based on canonical splice motifs. Thus 610 circular junctions remain, consisting of 1,120 canonical splice sites. Almost half of these splice sites (501) are also utilized in regular, colinear splice junctions. They are surprisingly well conserved: more than 60% are located in a region that is alignable in at least one other distant vertebrate and more than a third of these positions constitute a functional splice site according to the available experimental evidence, see Table 3. A comparison of circularizing splice junctions with recent reports of circular microRNA sponges in the human transcriptome (Hansen et al., 2013; Memczak et al., 2013) did not provide evidence for the conservation of these particular RNAs between mammals and coelacanth, however.

Table 3. Conservation of circular splice sites
SpeciesNo. of unique splice sites
Coelacanth619
Align.Annot.Pred.Cons.
  1. Since 501 of the 1,120 circular splice sites are also involved in normal splice events, we only used the remaining 619 for the conservation statistic. The first column shows, how many coelacanth splice sites could be “aligned” to the relevant species. The second column describes, the number of splice sites, which are even annotated as splice sites in this species. The abbreviation “pred.” refers to “predicted” splice sites, with a MaxEntScan score >3 in the aligned sequence. The last column summarizes the “conserved” splice sites, as the union of the “annotated” and “predicted” ones.

Human282117103132
Mouse27310296116
H or M296126115147
Frog29181102109
Stickleback263398189
8 Species375147173202

Trans-Spliced Transcripts

In the combined Latimeria RNA-seq data we found 17,066 trans-splice junctions connecting different scaffolds. Among these are 338 that are backed by more than 100 split reads. The majority of these splice sites were unique to trans-splicing events.

Table 4 summarizes the conservation of the trans-splicing sites. Only a third of them could be aligned to homologous sequences in other vertebrates. In most of these cases we observed a functional splice site in the other species. However, in general, the specificity for non-local junctions does not appear to be as conserved across species as other splice sites.

Table 4. Conservation of trans-splice sites
SpeciesNo. of unique splice sites
Coelacanth6,370
Align.Annot.Pred.Cons.
  1. Since 1,116 of the 7,486 trans-splice sites are also involved in normal splice events, we only used the remaining 6,370 for the conservation statistic. For a column description see Table 3.

Human1,8871,6161,6071,653
Mouse1,8151,5401,5341,583
H or M2,0231,7381,7461,781
Frog1,8141,2491,5251,550
Stickleback1,5456401,2741,306
8 Species2,4831,9642,1282,150

While circular transcripts have been validated in several studies and we are beginning to understand their regulatory functions, it seems that the reality of large numbers of trans-splicing-like RNAs is still not universally accepted. We therefore selected three examples for direct validation by PCR, see Figure 2 for details.

image

Figure 2. Three atypical transcripts validated by PCR. The green arrow represents the start codon, while the stop codon is represented by a red line combined with a red circle. The scaffold are represented as thick blue lines and the strand sense can be read from the orientation of the blue arrow-head. Gene-extent is represented as a light gray rectangle, while exon and CDS are represented by dark gray and wavy rectangles, respectively. (A) Distant splice event connecting gene ENSLACG00000011127 and ENSLACG00000016833. (B) Trans-splice events between gene AMMECR1 on scaffold JH26651.1 and ACADM on scaffold JH27663.1. (C) Fusion connecting the fifth exon of ENSLACG00000018504, which is a putative orthologue of human PYGB (phosphorylase) and an intronic region of ENSLACG00000016231 an orthologue of RGS3 (regulator of G-proteins).

Download figure to PowerPoint

Comparison of Normal and Atypical Transcripts

For a better comparison of the properties of circular and trans-spliced transcripts in the individual data sets we used sub-samples of equal size. In this way we obtained a comparable sequencing depth, which should at least alleviate the biases arising from very rare junctions in the largest data sets. While this simple normalization cannot account for differences in the expression profiles of the different tissues it should at least make the data sets qualitatively comparable.

Results for the two coelacanth species are very similar as shown in Supplementary Figure S1, hence we use their union. Figure 3 compares the atypical reads with normal (local and colinear) splice events for coelacanth, human, and zebrafish RNA libraries. As expected, the overwhelming majority of normal splice events utilizes canonical splice patterns. In contrast, circular, and trans-splice (long-range) events often use alternative sequence patterns, although a substantial fraction still conforms to the canonical motifs. We observed that in the coelacanth data, more circularizing splice junctions are off by 1 or 2 nt compared to both the human and the zebrafish data set. Adding these to the canonical subset (distance 0 in the left hand side panels of Fig. 3) yielded nearly the same fraction of about 70–80% canonical splice motifs as zebrafish and human. We note that this fraction is substantially larger than the numbers reported in Li et al. (2009). Surprisingly, however, the right-most two columns of Figure 3 shows that most of the circularizing and trans-joining splice junctions are disjoint from normal splice junctions. This effect is even more pronounced in coelacanth and zebrafish than in human. This pattern, which we observe for both the circularizing and the trans-junctions strongly suggests that the resulting unconventional transcripts are not merely a by-product of conventional, local splicing events.

image

Figure 3. Comparison of splice junctions in normal (local and colinear), circular, and trans-spliced reads. The first three columns of the panel summarizes the distribution of distances of predicted splice junctions from the closest canonical donor or acceptor motif (gt-ag). The remaining two columns show the distances to the nearest site that also harbours normally spliced reads. As expected nearly all normal splice events use canonical splice motifs in human and coelacanths alike. In contrast, circular and trans-spliced reads are often far away from and unrelated to normal splice events.

Download figure to PowerPoint

Since a substantial fraction of the circular and trans-splice junctions did not fit the canonical splice site motif, we searched for additional over-represented patterns in the remaining junctions. No significant pattern could be identified, however. We then searched for the “short homologous sequences,” that is, short sequences with four or more nucleotides, shared by the sequences surrounding the “splice junction.” According to Houseley and Tollervey (2010), however, these might be RTfacts. Supplementary Table S6 shows that such patterns are rare in our data, ranging from 0.7% to 2.6% of the circularized or trans-spliced transcripts. At the same time, the majority of atypical junctions are associated with canonical splice site motifs. We thus conclude that contamination levels in our data are low and the majority of both circular and trans-spliced RNAs cannot be explained as technical artifacts.

In Figure 4, we summarize the correlation between the coverage of a given locus and the abundance of splice events. For the normal splice events we clearly observed two populations. Along the x-axis we record a large number of rare splice junctions whose occurrence shows no correlation with the expression level. Along the main diagonal, on the other hand, efficiently spliced transcripts are recorded. Here the number of spliced reads essentially equals the coverage of the locus. For circular and trans-spliced loci we also observed a separation into two populations. The number of circular and trans-spliced reads, however, is typically only a fraction of the coverage of the locus. Most circular and trans-spliced transcripts thus originated from loci that also produce more conventional isoforms. Interestingly, measured in terms of RNA levels, “aberrant” isoforms were much more abundant in coelacanths than in zebrafish and human.

image

Figure 4. Correlation of coverage and splice site utilization. Two populations are clearly visible.

Download figure to PowerPoint

DISCUSSION

  1. Top of page
  2. ABSTRACT
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. ACKNOWLEDGMENTS
  7. LITERATURE CITED
  8. Supporting Information

Atypical transcripts, defined here as those that do not map locally and colinearly to the genomes, have only very recently been recognized as a major component of the transcriptome. We analyze here in detail the available RNA-seq data of two coelacanth species, L. chalumnae and L. menadoensis. With the help of an improved mapping algorithm implemented in segemehl (Hoffmann et al., submitted) that deals efficiently with both typical and atypical transcripts, we give here a detailed overview of the coelacanth transcriptome. In particular we report 51,488 additional expressed loci from which normal transcripts arise (576 protein-coding and 37,099 lincRNAs), together with 362 splice sites of circular RNAs and 4,698 of long-range (trans-spliced) connections. The very high fraction of junctions that use canonical splice sites strongly suggests that the overwhelming majority of these transcripts cannot be explained as “RTfacts” and must be interpreted as biological reality.

A particularly interesting feature is the unexpectedly high level of evolutionary conservation of splice sites involved in circularization. The majority of these sites does not appear in normal transcripts. Their conservation suggests that they are of functional importance. Recent reports of abundant, stable, and often conserved circular RNAs in mammals have identified them as an important class of regulatory molecules (Hansen et al., 2013; Jeck et al., 2013; Memczak et al., 2013). Our results strongly indicate that such “atypical” transcripts are evolutionarily old, dating back at least to an osteichthyan ancestor. Non-local trans-spliced transcripts are even less-well understood. The statistical similarities in splice site usage and conservation between trans-spliced and circularized products, suggests that at least a subset is also functional. This observation is further strengthened by evolutionary conservation of at least a fraction of the non-local trans-splice sites, albeit a smaller fraction than with the circularizing sites. Future exploration of the functional significance of “atypical” transcripts, such as these, promises to yield many new insights.

ACKNOWLEDGMENTS

  1. Top of page
  2. ABSTRACT
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. ACKNOWLEDGMENTS
  7. LITERATURE CITED
  8. Supporting Information

LIFE—Leipzig Research Center for Civilization Diseases, Universität Leipzig is funded by means of the European Social Fund and the Free State of Saxony.

LITERATURE CITED

  1. Top of page
  2. ABSTRACT
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. ACKNOWLEDGMENTS
  7. LITERATURE CITED
  8. Supporting Information

Supporting Information

  1. Top of page
  2. ABSTRACT
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. ACKNOWLEDGMENTS
  7. LITERATURE CITED
  8. Supporting Information

Additional supporting information may be found in the online version of this article at the publisher's web-site.

FilenameFormatSizeDescription
jezb22542-sm-0001-SupData-S1.pdf43K

Figure S1. Comparison of splice junctions in normal (local and co-linear), circular, and trans-spliced reads. The first three columns of the panel summarizes the distribution of distances of predicted splice junctions from the closest site with a canonical donor or acceptor motif (gtag). The remaining two columns show the distances to the nearest site that also harbours normally spliced reads.

Table S1. Split numbers.

Table S2. Summary of mapped reads.

Table S3. Comparison of observed splice junctions in L. chalumnae and L. menadoensis, resulting from reads mapped with local, colinear splits. The addition “filtered” describes the filtering by haarz mapping criteria and a minimum junction support of three reads.

Table S4. Relation of splice junctions to annotation.

Table S5. Conservation of normal splice sites.

Table S6. SHS-motives in circular and trans-spliced RNAs are indicative of “RTfacts.”

Table S7. Primers used to validate putative fused transcripts.

Table S8. Sequences for subcloned PCR products from validation experiment.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.