One of the most surprising results to emerge from mammalian cDNA sequencing projects is that thousands of mRNA-like non-coding RNAs (ncRNAs) are expressed and constitute at least 10% of poly(A)+ RNAs. In most cases, however, the functions of these RNA molecules remain unclear. To clarify the biological significance of mRNA-like ncRNAs, we computationally screened 11 691 Drosophila melanogaster full-length cDNAs. After eliminating presumable protein-coding transcripts, 136 were identified as strong candidates for mRNA-like ncRNAs. Although most of these putative ncRNAs are found throughout the Drosophila genus, predicted amino acid sequences are not conserved even in related species, suggesting that these transcripts are actually non-coding RNAs. In situ hybridization analyses revealed that 35 of the transcripts are expressed during embryogenesis, of which 27 were detected only in specific tissues including the tracheal system, midgut primordial cells, visceral mesoderm, germ cells and the central and peripheral nervous system. These highly regulated expression patterns suggest that many mRNA-like ncRNAs play important roles in multiple steps of organogenesis and cell differentiation in Drosophila. This is the first report that the majority of mRNA-like ncRNAs in a model organism are expressed in specific tissues and cell types.
It is becoming increasingly clear that an enormous number of genes in both prokaryotes and eukaryotes encode RNA, rather than polypeptide, as gene products. Non-coding RNA (ncRNA) is a general term for such functional and untranslatable RNAs, and encompasses bacterial riboregulators, signal recognition particle (SRP) RNAs, spliceosomal small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs) and micro RNAs (miRNAs), as well as classical ribosomal and transfer RNAs (Eddy 2001; Griffiths-Jones et al. 2003; Szymanski et al. 2003; Pang et al. 2005). Studies on these ncRNAs have shown that they play important roles in a variety of biological events such as protein transport, transcriptional and translational gene regulation, and RNA processing.
In addition to these conserved and relatively low-molecular-weight ncRNAs, recent progress in genomics has revealed that eukaryotic cells also contain numerous ncRNAs that are totally uncharacterized. Indeed, one of the most striking findings from large-scale cDNA sequencing projects in mammals is the unexpected abundance of mRNA-like non-coding RNAs (Okazaki et al. 2002; Numata et al. 2003; Ota et al. 2004), which have similar molecular profiles to protein-coding mRNAs and are transcribed by RNA polymerase II, polyadenylated and often spliced (Erdmann et al. 1999). At least 13% and 26% of the unique full-length cDNAs in mice and humans, respectively, are thought to be poly(A) tail-containing mRNA-like ncRNAs (Okazaki et al. 2002; Numata et al. 2003; Ota et al. 2004). The abundance of mRNA-like ncRNAs in these organisms strongly suggests that they have important roles in a wide range of biological phenomena; analysis of mRNA-like ncRNA is thus a critical requirement for comprehensive understanding of genome function.
Several mRNA-like ncRNAs are known to play essential roles in vivo. For example, Pgc RNA in Drosophila is specifically expressed in primordial germ cells, and is required for maintenance of germ cell fate (Nakamura et al. 1996). Pgc regulates transcriptional repression during early embryonic development, probably by modifying the C-terminal domain of RNA polymerase II (Martinho et al. 2004). The bereft RNA is also expressed tissue-specifically, in the Drosophila peripheral nervous system (Hardiman et al. 2002). The bereft mutant flies exhibit aberrant development of extrasensory organs and loss or malformation of the interommatidial bristles of the eye. Another intensively studied example is the Xist RNA in mice and humans, which is specifically localized onto the inactive X chromosome (Xi) in females and represses transcription from Xi-linked genes, resulting in dosage compensation (Avner & Heard 2001; Plath et al. 2002). Drosophila roX1 and roX2 are also functional RNAs that are essential for dosage compensation, although these RNAs hyperactivate transcription from the X chromosome in males (Meller & Rattner 2002; Kelley 2004). Nevertheless, most mRNA-like ncRNA molecules, including those identified in cDNA sequencing projects, have not been characterized and their functions remain unknown.
What is the biological significance of mRNA-like ncRNAs? Are they also abundant in non-mammalian organisms? To answer these questions, we have carried out genome-wide in silico screening for mRNA-like ncRNA in Drosophila and analyzed their expression by in situ hybridization. We identified 136 unannotated cDNAs as ncRNA candidates, having neither long ORFs (>100 amino acids) nor similarity with protein-coding genes. Comparative genomic analysis showed that these RNA genes have evolved more rapidly than protein-coding sequences. One quarter of these transcripts were detected during embryogenesis, and most were expressed in a tissue specific manner. These results suggest that mRNA-like ncRNAs play important roles in organogenesis and cell differentiation during Drosophila development.
In silico screening for non-protein-coding transcripts in Drosophila
To elucidate the biological significance and functions of mRNA-like ncRNAs, we chose D. melanogaster as a model system and carried out a computational analysis of cDNAs to identify mRNA-like ncRNA candidates. As starting material, we used a public collection of 11 691 Drosophila full-length cDNAs, available from the Berkeley Drosophila Genome Project (Rubin et al. 2000). Each exon of these cDNAs was aligned with the Drosophila genome sequence by the SIM4 program (Florea et al. 1998), and any transcripts derived from computationally and experimentally defined protein-coding genes or transposable elements were eliminated based on the Drosophila genome annotations Release 4 (see Experimental procedures). We also checked whether the remaining cDNAs had any similarities with known protein-coding sequences by the BLASTX program. As a result, we obtained 241 transcripts as primary candidates for mRNA-like ncRNAs (Fig. 1A; Table S1).
To assess the protein-coding probability of these transcripts, we first examined their length and presumable ORFs. The mean length of the 241 candidate cDNAs is 1.9 kb, similar to that of the 10 534 protein-coding transcripts, 2.0 kb. On the other hand, the mean maximum ORF length of the ncRNA candidates is 104 amino acids, much shorter than that of the protein coding transcripts, 459 amino acids, suggesting that our screening was effective in extracting transcripts that do not have long ORFs. As a control, we randomized the sequence of each candidate and determined the maximum ORF length (Table S2). These control non-protein-coding sequences and our candidate transcripts have generally similar profiles, but have two notable distinctions (Fig. 1B). First, the non-coding candidates include a number of transcripts containing an ORF of 200 amino acids or longer. We did the same randomization analysis 3 times, using independently generated control sequences, but never found an ORF of more than 160 amino acids (Fig. 1B; Table S2). Second, the mean maximum ORF length of the control sequences (59.4 amino acids) is even shorter than that of the non-coding candidates (104.1 amino acids) and most (> 90%) of the ORFs in the computer-generated sequences are shorter than 100 amino acids. These results suggest that some of the non-coding candidates have protein-coding potential. We therefore excluded 82 cDNAs containing an ORF of 100 amino acids or longer from the candidates.
Although “full-length” cDNA in Drosophila was defined as the longest clone of each EST cluster, assembled from a collection of more than 250 000 Drosophila ESTs (Stapleton et al. 2002a,b), it is still possible that each cDNA does not contain a whole transcriptional unit. If a cDNA were truncated and contained, for example, only a 5′- or 3′-UTR, it would be judged ncRNA according to our criteria. To assess this possibility, we next examined gene proximities of the ncRNA candidates. Gene proximity of the primary ncRNA candidates was evaluated as the distance from each cDNA to adjacent genes, calculated from positional information and genome annotations. The overall distribution profile of mRNA-like ncRNA candidates was similar to that of protein-coding genes (Fig. 1C, upper panels), indicating that only a limited number of truncated cDNA artifacts were contaminating our candidates. However, the ncRNA candidates include a small but clear peak very close to flanking genes (Fig. 1C, lower panels), suggesting that those located within 200 bp of an adjacent gene might include false-positive cDNAs. After eliminating these 23 cDNAs, we retained 136 transcripts as strong candidates for putative mRNA-like ncRNAs, including the well-studied ncRNAs Pgc and roX1, and 12 previously annotated CR (computational RNA) genes (Fig. 1A; Table S3).
Nucleotide rather than predicted amino acid sequence of the putative Drosophila mRNA-like ncRNA is conserved
It is difficult to exclude the possibility that each ncRNA candidate actually encodes a small peptide. However, if the RNA itself is a functional molecule, its nucleotide sequence, rather than predicted amino acid sequence, should be evolutionarily conserved. To test this idea, we compared nucleotide and predicted amino acid sequences of the mRNA-like ncRNA candidates in D. melanogaster and D. pseudoobscura, which branched 25 million years ago (Mya) (Russo et al. 1995). By BLASTN analysis, 94 of the 136 ncRNA candidates were similar to nucleotide sequences in the D. pseudoobscura genome (E-value < 10−10), indicating that most of the RNAs are conserved in these two species. On the other hand, when predicted amino acid sequences from the longest ORF in each ncRNA were compared with those of the D. pseudoobscura genome, only three showed significant similarity (Table S4). Even in these three cases, similarities between the predicted amino acid sequences were much lower than those between the nucleotide sequences. These results lend further support to the idea that the identified mRNA-like ncRNA candidates do not encode proteins but function as RNA molecules.
mRNA-like ncRNA candidates have evolved more rapidly than protein-coding genes
In the above similarity search between D. melanogaster and D. pseudoobscura, we also found that ncRNA candidates are less highly conserved than protein-coding genes. Only 69.1% of D. melanogaster mRNA-like ncRNA candidates show significant similarity (BLASTN, E-value < 10−10) to sequences in the D. pseudoobscura genome, whereas by the same criterion 84.5% of the protein-coding transcripts have counterparts in this species (Table 1). The same trend was found in more distantly related drosophilid species: compared to approximately 70% conservation for the protein-coding transcripts, only 49.3 and 48.5% of the D. melanogaster ncRNA candidates are conserved in D. willistoni (36 Mya) and D. virilis (39 Mya), respectively. As well as the Drosophila species, we also analyzed another dipteran insect, mosquito (Anopheles gambiae), and found nucleotide similarities in 16.2% of the ncRNA candidates. Further comparisons with mouse and human genomes revealed that only three of the D. melanogaster ncRNA candidates showed similarities with mammalian sequences. Since these three genomic sequences in mammals encode known proteins, we assume that the similarities are fortuitous. These results suggest that non-coding transcripts have evolved more rapidly than protein-coding genes in multicellular organisms.
Table 1. mRNA-like ncRNAs are less conserved than protein-coding transcripts
Number of genes showing similarity with the D. melanogaster ncRNA candidates by BLASTN analysis (< 10−10).
The majority of mRNA-like ncRNA candidates show tissue-specific expression patterns during Drosophila embryonic development
To assess the biological function(s) of the putative mRNA-like ncRNAs in vivo, we examined expression patterns during Drosophila development. From whole-mount in situ hybridization analyses of the available 135 cDNAs, we found that 33 transcripts, as well as Pgc and roX1, are expressed during Drosophila embryogenesis (Table 2). For convenience, we refer to these 33 embryonic RNAs as MREs (mRNA-like ncRNAs in embryogenesis) in this study, although these transcripts and their genes may have to be renamed after functional analysis in the future. Of the 33 MREs, only seven are expressed ubiquitously, while 26 were detected specifically in the tracheal system, digestive system, visceral mesoderm, germ cells and the central and peripheral nervous system (Fig. 2). Interestingly, the majority of the MREs were restricted to the nervous system (Table 2). Each nervous system-specific transcript, however, exhibited an apparently distinct expression pattern (Fig. 4; Fig. S1). For example, MRE-1 and -6 are specifically expressed along the midline cells of the ventral ganglia, which play an important role in commissure formation (Fig. 3A,B). On the other hand, MRE-24, -31 and -32 were found in a subset of neuroblasts (Fig. 3D–F) and MRE-12 was detected throughout the central nervous system (Fig. 3C). These expression patterns suggest that ncRNAs play important roles at multiple steps in Drosophila neurogenesis during embryonic development. As well as Pgc, which is essential for proper germ cell development, several MREs are also expressed in primordial germ cells. During early embryogenesis, MRE-11, -21 and -30 are specifically expressed in pole cells at the posterior part of the embryos (Fig. 3G–I). These three MREs are also expressed in other tissues (e.g. central nervous system or midgut primordial cells) in later developmental stages (Fig. 1), suggesting that these MREs are multifunctional and involved in development of the different cell types.
Table 2. mRNA-like ncRNAs expressed during Drosophila embryogenesis
CNS, central nervous system; PNS, peripheral nervous system.
a. newly identified transcripts
Hind gut, proventriculus
CNS, pole cells
Endoderm, pole cells
Mesoderm, trachial pits
b. previously identified conputational RNAs
CNS, pole cells
Endoderm, somatic Mesoderm
c. functionally defined transcripts
In addition to varied spatial expression, several RNAs showed temporal variability during development. For example, MRE-1 and -24 are expressed in amnioserosa during early embryogenesis, although they are specifically expressed in the nervous system at later stages (Fig. 4A,B). Even within the central nervous system, cells expressing MRE-32 are developmentally changed. MRE-32 is expressed along the midline cells at stage 10–12, and expression in another subset of nerve cells appears at stage 12–14 (Fig. 4C). Maternal and ubiquitous distribution of MRE-2 is replaced by zygotic expression in the central nervous system during mid- embryogenesis (Fig. 4D). These results demonstrate that the MREs are strictly regulated during Drosophila embryogenesis. Taken together, expression analyses of the MREs strongly suggest that mRNA-like ncRNAs play important roles in organ development and cell differentiation.
In this study, we have identified 136 putative mRNA-like ncRNAs, which lack significant protein-coding potential, in a collection of more than 10 thousand unique Drosophila cDNAs. We also showed that 27 of these ncRNA candidates are expressed in a variety of tissues during embryonic development. These results suggest that mRNA-like ncRNAs are important components of the Drosophila transcriptome and are involved in a variety of developmental events.
BLAST analyses indicate that the non-coding transcripts identified in this study are not evolutionarily conserved among multicellular organisms (Table 1). It might be argued that these results imply that the predicted non-coding transcripts would not be functionally important and are actually “junk” RNAs. However, a growing number of studies have shown that mRNA-like ncRNAs can play critical roles in a variety of biological events, despite poor conservation of their nucleotide sequences. For example, the roX1 and roX2 RNAs in Drosophila and the Xist RNA in mammals play critical roles in dosage compensation, in which non-coding RNAs specifically recognize the X chromosome and regulate its transcription (Lee et al. 1996; Herzing et al. 1997; Meller & Rattner 2002). Although the roX RNAs and Xist are essential to particular species, they are not evolutionarily conserved even within insects and vertebrates, respectively: the Anopheles genome does not possess any sequences similar to roX1 and roX2, and a counterpart of Xist is not found in pufferfish (data not shown). Another example of poor conservation in ncRNAs is tRNA. Although tRNA is well conserved in its secondary (clover-leaf) and tertiary (L-shape) structure, its primary structure is highly divergent among species. The fact that primary structure of the mRNA-like ncRNA candidates is not conserved between Drosophila and mammals suggests that efforts to identify conserved ncRNA sequences or RNA motifs may be less fruitful than structural analysis.
Generally, it is difficult to confirm that a certain transcript is translated in living cells. To assess protein-coding probability of the mRNA-like ncRNA candidates, we employed comparative analysis of these transcripts and successfully found that ORFs of the non-coding transcripts have a clear tendency of poor conservation during evolution (Table S4). Compared with other experimental procedures, this method is more rapid and convenient. For example, in previous reports, in vitro translation and production of specific antibodies against predicted small peptides have been tried (Lanz et al. 1999; French et al. 2001; Ji et al. 2003). However, it is difficult to refute the possibility that in vitro translation does not reflect translational regulation in vivo, and production of specific antibody sometimes depends on fortuity and is time-consuming. Comparative genomics does not ascertain whether transcripts are translated or not, but it is an easy way to identify evolutionarily conserved small ORFs. Although comparative genomics requires at least nearly complete genome sequence information and is thus suitable only for model organisms, it could be a powerful tool to analyze novel RNA molecules as we have demonstrated in this study.
The number of Drosophila ncRNA candidates identified by our in silico screening (Fig. 1, 136 transcripts) is much smaller than those in mouse and human (4280 and 5481, respectively) (Numata et al. 2003; Ota et al. 2004). This difference in numbers between the taxa may reflect difference in the amount of intergenic region in their genomes. The number of protein-coding genes in Drosophila (∼14 000) (Yandell et al. 2005) is broadly similar to that in mammals (25 000–30 000) (Waterston et al. 2002; I.H.G.S. Consortium 2004), whereas the Drosophila genome is approximately 1.76 × 108 bp (Celniker et al. 2002; Bennett et al. 2003), only one-tenth the size of mammalian genomes. Drosophila thus has a more compact genome than mammals, whose genomes are composed mainly of intergenic non-coding regions. Mattick (2004) has pointed out that the complexity of organisms and the amount of non-coding regions in their genomes show significant correlation; that is, higher eukaryotes like mammals have disproportionately large genomes in comparison with simpler organisms. Therefore, the difference in the ncRNA numbers may also reflect complexity of these organisms. Another potentially important factor is that the public Drosophila cDNA collection does not represent the whole genome. Detailed annotation by the Genome Project revealed that the Drosophila genome contains approximately 14 000 protein-coding genes (Yandell et al. 2005). Thus, it should be noted that only 84% of the Drosophila genes are included in the cDNA collection (11 691 cDNAs), and further analysis of ESTs should uncover more mRNA-like ncRNAs.
The current study has shown that expression of many mRNA-like ncRNA candidates is spatially and temporally restricted during Drosophila embryogenesis (Figs 2 and 4). These highly regulated expression patterns suggest that the majority of MREs are involved in developmental events, such as cell differentiation and organogenesis. The tissue-specific ncRNA candidates are predominantly expressed in the central and peripheral nervous system (Fig. 2; Table 2), suggesting that the major site of action for the mRNA-like ncRNAs is the nervous system. The only previously known example of an mRNA-like ncRNA molecule that is involved in Drosophila neurogenesis is the bereft RNA (Hardiman et al. 2002), which is required for proper development of extrasensory organs and is downstream of the selector gene cut. The physiological function of bereft RNA led us to predict that other ncRNAs, including our candidates, may also be involved in neural development. If the complexity of organisms (see above) and the number of mRNA-like ncRNAs correlate with each other, complexity may be partly attributable to nervous system-specific ncRNAs.
Although little is known about the biochemical activities of mRNA-like ncRNAs, previous studies on other ncRNA species have shed light on common features of ncRNAs. For example, a large number of small ncRNAs (e.g. miRNAs and snoRNAs) specifically recognize target RNA and function as guide molecules (Ni et al. 1997; Lau et al. 2001; Lee & Ambros 2001). On the other hand, several higher-molecular-weight RNAs, including SRP RNA and ribosomal RNA, function as molecular scaffolds of ribonucleoprotein (RNP) complexes (Ban et al. 1999; Clemons et al. 1999; Kuglstatter et al. 2002). We assume that mRNA-like ncRNAs also have biochemical activities similar to these ncRNAs, in Drosophila perhaps functioning as guide molecules for recognition of post-transcriptionally regulated mRNAs. Alternatively, mRNA-like ncRNAs may be constituents of RNP complexes and orchestrate activities of other protein subunits during development. Further analysis of their physiological functions, with the aid of Drosophila genetics, should elucidate the roles of these RNA molecules in vivo.
In silico screening for Drosophila mRNA-like ncRNAs
Computational analysis of putative mRNA-like ncRNAs
For simulation analysis, control randomized sequence for every single candidate transcript was generated using a perl script without any changes in base composition. We generated control sequences 3 times and profiles of the control sequences were very similar to each other. Possible ORF length of the randomized sequences was calculated and the average length of the longest ORF of each transcript was compared with that of mRNA-like ncRNA candidates.
cDNA clones in this study were obtained from the National Institute of Genetics (Mishima, Japan) or the Drosophila Genomics Resource Center (Bloomington, IN, USA). As probes for in situ hybridization, ∼500-bp DNA fragments of each cDNA clone were amplified by PCR with specific primers (Table S1) and inserted into pCR-BluntII-TOPO (Invitrogen). After treatment with appropriate restriction enzymes, strand-specific RNA probes were prepared by in vitro transcription with T7 or T3 RNA polymerase (EPICENTRE) and DIG-UTP (Roche). Probes for several cDNAs were transcribed from the original clones in the pOT2 vector, with T7 or SP6 RNA polymerase (EPICENTRE). Pretreatment of embryos prior to in situ hybridization was performed as previously described (Nagaso et al. 2001), and all processes except color development were carried out by InsituPro (INTAVIS AG). Parameters of automatic hybridization processes are summarized in Table S2.
We thank Jonathan L. Tupy, Adina M Bailey and Gerald M. Rubin for communication of unpublished results. We also thank Ian Smith, Kunio Inoue and Yoshiko Hashimoto for critical reading of the manuscript, and the Genetic Strain Research Center in National Institute of Genetics (Mishima, Japan) for kindly supplying Drosophila cDNAs. This study was supported in part by Grants-in-Aid for the 21st Century Center of Excellence Program from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan, as well as Grants-in-Aid for Scientific Research on Priority Areas and for Encouragement of Young Scientists from MEXT to Y.K. This study was also supported by a Grant for Young Scientists from the Foundation for Nara Institute of Science and Technology to S.I., T.K. and Y.K.