EJC, exon junction complex; mRNP, mRNA protein particle; NMD, nonsense-mediated decay; ORF, open reading frame; TREX, transcription export; UTR, untranslated region; 3UI, 3′-UTR intron; 5UI, 5′-UTR intron.
Although introns in 5′- and 3′-untranslated regions (UTRs) are found in many protein coding genes, rarely are they considered distinctive entities with specific functions. Indeed, mammalian transcripts with 3′-UTR introns are often assumed nonfunctional because they are subject to elimination by nonsense-mediated decay (NMD). Nonetheless, recent findings indicate that 5′- and 3′-UTR intron status is of significant functional consequence for the regulation of mammalian genes. Therefore these features should be ignored no longer.
A clearly appreciated role for introns in higher organisms is to allow for alternative splicing, which permits a single gene to encode many different proteins. Less widely appreciated, however, is that the presence of an intron and the act of its removal by the spliceosome can influence almost every step in gene expression from transcription and polyadenylation to mRNA export, localization, translation, and decay 1, 2. These influences modulate both the levels and localization of expressed proteins. While ∼90% of human introns occur within protein coding regions (open reading frames; ORFs), many also reside in untranslated regions (UTRs). Approximately 35% of human 5′-UTRs 3, and between ∼6% (NCBI's Reference Sequence; RefSeq) and ∼16% (Vertebrate Genome Association; Vega) of human 3′-UTRs are annotated as harboring introns. Yet despite their prevalence, introns in UTRs are rarely considered as distinctive entities with specific regulatory functions. Indeed, until very recently the prevailing view of 5′-UTR introns (5UIs) was that they are only special insofar as they are proximal to the 5′ end of the transcript. Further, a common view of 3′-UTR introns (3UIs) is that they are signatures of nonfunctional transcripts arising solely from genomic noise (e.g. pseudogenes, transposons), genetic mutation, or errors in splicing. This view stems from the observation that mammalian mRNAs with an intron excision site >55 nucleotides downstream of a termination codon are subject to degradation by the nonsense-mediated decay (NMD) pathway 4–8. Reflecting the widespread view that NMD is restricted to mRNAs encoding inappropriately truncated proteins, NCBI's RefSeq database routinely excludes most 3UI-containing transcripts from its annotated coding transcripts 9. Nonetheless, recent evidence clearly indicates that 5UIs and 3UIs do have important and unique roles in the regulation of gene expression that should not be overlooked. Below, we describe evidence that the presence or absence of a 5UI has significant consequences for both mRNA nuclear export and cytoplasmic mRNA metabolism, and that 3UIs have multiple roles in modulating normal protein expression.
All introns can influence gene expression regardless of their position relative to the coding region because they alter the protein makeup of the mRNA protein particle (mRNP). One set of splicing-dependent mRNP proteins is the exon junction complex (EJC), deposited by the spliceosome ∼24 nts upstream of exon junctions on spliced mammalian mRNAs 10, 11. This multiprotein complex remains tightly bound to the mRNA until the first round of translation, when EJCs within the coding region are displaced by ribosomes as they translocate across the message (Fig. 1) 12. Until then, the EJC core serves both as a molecular marker of prior intron position, and as a binding platform for peripheral proteins. These peripheral factors associate transiently with the core and help regulate the subcellular localization, translation, and decay of the transcript 2, 13–15.
Also deposited on mRNAs during transcription and splicing are the transcription export (TREX) complex and SR proteins. In mammals, the TREX complex is recruited primarily to the 5′ end of transcripts through cooperative action of the nuclear cap-binding complex and the spliceosome 16, 17. Once bound to the mRNA, the TREX complex promotes nuclear export of fully processed transcripts through the nuclear pore by direct interactions between the TREX component, Aly, and the nuclear export factor, TAP-p15 18, 19. SR proteins are best known for their roles in exon definition and as alternative splicing regulators. However, they are also subject to splicing-dependent dephosphorylation, which promotes their tight association with the spliced mRNA. As mRNP components, SR proteins can enhance nucleocytoplasmic export, translation, and decay of their bound mRNAs 20–22. Thus, as elaborated below, one means by which 5UIs and 3UIs influence gene expression is by promoting the loading of mRNP proteins with downstream functional consequences.
Initial models suggested that 5UIs evolved under nearly neutral genetic selection, implying that they have no specific function 23. If this were the case, one would expect 5UIs to be equally distributed among transcripts of all functional classes. Recent analyses, however, have revealed that genes having or lacking 5UIs fall into distinct functional classes, at least in the human and rat genomes. Whereas genes with regulatory functions are enriched for 5UIs, genes encoding proteins targeted to the endoplasmic reticulum (ER) or mitochondria are significantly depleted of such introns 3, 24. When 5UIs are present, they are necessarily the most 5′ proximal introns in a transcript, and 5′ proximal introns have a disproportionate role in regulating transcription, mRNA export, and translation 16, 25–28. However, 5′ proximity alone cannot explain the functional distribution of transcripts that do or do not contain 5UIs. Importantly, transcripts possessing only coding-region introns, and in which the first intron has the same proximity to the transcription start site as a 5UI, do not display the same functional distribution as 5UI-containing transcripts 3.
The enrichment of 5UIs in regulatory genes could reflect their tendency to have more transcription factor binding sites, which are often located within the first intron 26. In addition, deposition of splicing-dependent mRNP components as close as possible to the 5′ end of the mRNA could play a positive role in facilitating rapid export and translation of the newly made mRNAs 28.
On the other hand, some transcripts have evolved to exclude introns from their 5′-UTRs because this allows them to use an alternate mode of nuclear export, the ALREX mRNA export pathway. Unlike the canonical TREX-dependent nuclear export pathway, the ALREX pathway does not require splicing 29. Instead, ALREX facilitates mRNA export via a specific RNA sequence element located within the 5′ end of the ORF 29, 30. This sequence element is particularly prominent in transcripts encoding ER and mitochondrial-targeted proteins, the same functional class that is depleted of 5UIs. The current model is that when ALREX elements are present, their position relative to the first intron dictates the method of mRNA export. For transcripts lacking a 5UI, if an ALREX sequence is present at the 5′ end of the ORF, it is likely to be upstream of the first intron (which would be in the ORF). Thus, the ALREX pathway is used to export the mRNA from the nucleus. On the other hand, for transcripts containing a 5UI, the first intron is necessarily upstream of the ORF. These mRNAs are exported by the canonical TREX pathway regardless of whether an ALREX element is present in the ORF (Fig. 2). In support of this model, nucleotide sequences near the 5′ end of the ORF strongly correlate with 5UI status, and only sequences derived from 5UI-lacking transcripts can support mRNA export in the absence of splicing 30. Although the ALREX sequence was first identified in ER- and mitochondrial-targeted genes, it can still function as a splicing-independent export element in other sequence contexts 30. Therefore the ALREX pathway is likely relevant beyond ER and mitochondrial genes, and 5UI status is likely important for regulating export of additional classes of transcripts.
Why do alternate export pathways exist and how do they contribute to overall gene expression? Currently available data suggest that the selection of which nuclear export pathway to use can have downstream functional consequences. For example, a model mRNA exported by the TREX pathway is initially sequestered in stress granules, whereas an almost identical mRNA targeted to the ALREX pathway is not 29. Thus mRNAs exported by the ALREX pathway may be more readily available for immediate translation under conditions of stress. Another intriguing possibility is that alternative promoter usage leading to inclusion or exclusion of a splicing event upstream of an ALREX element can allow switching between the two export pathways, thereby regulating subsequent mRNA expression. Expressed sequence tags (ESTs), cap analysis of gene expression (CAGE), and RNASeq data indicate that alternate promoter use is widespread in higher eukaryotes; 30–50% of all human and mouse genes have been reported to contain alternate promoters 31, 32. To a smaller extent, 5UIs can be alternatively retained in the mature mRNA, rather than spliced out 33, 34. It is currently unknown whether mRNAs containing ALREX elements are enriched for alternate promoter use or alternate intron retention, but this is clearly an interesting avenue of further research.
In mammals, when an mRNA enters the cytoplasm and is translated, the nature of the translation termination event determines whether the transcript will persist and continue to produce protein, or become degraded by the NMD pathway 35. NMD occurs when Upf1, which is bound to the terminating ribosome, interacts with Upf2, a peripheral EJC protein 36. EJCs bound within ORFs are removed by the ribosome during translation, but an EJC downstream of the termination codon (i.e. in the 3′-UTR) should persist and stimulate NMD (Fig. 1) 15. Thus, NMD strictly requires translation of the mRNA target. Current models suggest that the mRNA is translated only once before it is destroyed, producing a single molecule of protein per transcript 37. Upf1 and Upf2 can also interact on a transcript and stimulate NMD when the transcript has a particularly long 3′-UTR 38. This additional form of NMD is presumably splicing-independent, and is discussed elsewhere 7, 38.
NMD is perfectly suited to reduce the abundance of several classes of nonfunctional mRNAs. First, mutations or aberrant pre-mRNA splicing frequently introduce premature termination codons (PTCs) upstream of an EJC deposition site. NMD thus prevents production of potentially deleterious truncated proteins 39. Second, NMD clearly serves to dampen the expression of nonfunctional transcripts arising from pseudogenes, expressed transposons, or integrated retroviruses, which frequently contain termination codons upstream of introns 40. Finally, during programmed gene rearrangements in the T-cell receptor and immunoglobulin genes, unproductively rearranged alleles generate termination codons upstream of introns, and the resulting mRNAs are degraded by NMD 41.
Because transcripts containing 3UIs can be produced by mistake and because NMD degrades such transcripts, it is often assumed that all 3UI-containing transcripts are nonfunctional. Therefore, in an effort to accurately represent only functional mRNAs in the transcriptome, the NCBI Reference Sequence (RefSeq) database has suppressed the majority of 3UI-containing “coding” sequences (accession prefix NM_) and reassigned them “noncoding” accession numbers (accession prefix NR_ or XR_). As of April 2011, 846 human transcripts that were considered well-supported by RefSeq had been designated “noncoding” solely because they are predicted NMD substrates. However, the designation of these transcripts as noncoding is not a trivial matter: mRNA prediction algorithms based on ESTs estimate that 35% of all human alternatively spliced isoforms contain 3UIs 42. Are all of these mRNAs inconsequential for protein production? Or can an mRNA predicted to be destroyed shortly after translation still be considered real and functionally relevant?
In addition to facilitating elimination of the nonfunctional mRNAs described above, significant evidence exists that some 3UIs serve to modulate normal gene expression 40, 43–49. Three classes of 3UI-containing mRNAs are both functional and subject to NMD (Fig. 3). First are mRNAs with short ORFs in the 5′-UTR (upstream ORFs; uORFs). If uORF translation occurs prior to translation of the main ORF, exon junctions within the main ORF are effectively in the 3′-UTR of the uORF, and this can elicit NMD. In this way, translation of uORFs can serve to modulate mRNA levels and thus expression of the main ORF 40. The second class consists of mRNAs in which a termination codon upstream of an exon junction is “intentionally” introduced by alternative splicing. Such a splicing event can lead to mRNA down-regulation through a process known as alternative-splicing linked to NMD (AS-NMD) 50. The third class consists of mRNAs that are constitutively spliced within the 3′-UTR 51. These transcripts are expected to be degraded every time they are translated, unless the NMD pathway is inhibited.
Proteins produced from such 3UI-containing transcripts have been detected in cells. In 2004, Hillman et al. 52 analyzed 1,363 human protein sequences deposited into the Swiss-Prot database and found 107 entries (7.9% of those analyzed) that were derived from transcripts that are apparently subject to NMD. More recently an analysis of mass spectrometry data from the Global Proteome Machine 53 and PeptideAtlas 54 repositories reached the same conclusion: 3UI-containing transcripts can indeed express protein 55. These results suggest that either the peptides detected in the above proteomics studies are the products of a single round of translation 37, or some endogenous human NMD substrates can undergo multiple rounds of translation prior to being decayed. Consistent with the latter idea, NMD in budding yeast can occur during any round of translation, not just the first 56.
Given that mRNAs containing 3UIs are surprisingly common (an estimated 35% of alternatively spliced isoforms 42 plus those with uORFs and constitutive 3UIs; see above), the challenge is to distinguish functional 3UIs from those representing genomic noise or cellular errors 57. Two potential indicators of function are conservation and tissue-specific expression. These concepts can be applied to all three classes of 3UI-containing transcripts described above (Fig. 3).
The first class of 3UI-containing transcripts is those with uORFs. Sequence analysis has revealed that approximately 35% of human genes harbor uORFs; of these, 38% are conserved among human, mouse, and rat 58. Further, ribosome profiling has recently shown that 26% of all translationally active ORFs in mouse embryonic stem cells are actually uORFs 59. Thus uORFs are prevalent, translated, and often conserved, indicating that many are functional. However, uORFs may have functions besides NMD 60, 61, so the presence of a conserved uORF does not necessarily mean an mRNA is regulated by NMD. For example, it is possible that some ribosomes translating a uORF either fail to recognize the uORF termination codon or reinitiate on the main ORF downstream 62. Functional studies will be necessary to determine how many mRNAs are regulated via recruitment of NMD factors upon uORF termination.
The second class is AS-NMD transcripts. Among alternative splicing events conserved between human and mouse, approximately 21% introduce termination codons upstream of introns 57, 63. This estimate, based on traditional transcriptomics, was recently re-examined by an in-depth analysis of the 309 protein coding genes within the ENCODE pilot phase regions of the human genome 64. By including next generation sequencing and RT-PCR data, that study produced an extremely reliable dataset of 162 conserved alternative splicing events. Of these, 27 (17%) introduce 3UIs. Thus, alternative splicing often leads to 3UI-containing transcripts that are conserved and therefore potentially functional.
To address the question of tissue-specificity of 3UI-containing transcripts, Pan et al. 57 undertook a comprehensive analysis of alternatively spliced 3UI-containing isoforms across ten different mouse tissues. Using exon and splice junction arrays to examine inclusion and exclusion of cassette exons, they found little evidence of tissue-specific expression of NMD isoforms. Nonetheless, they also concluded that NMD isoforms are expressed at low levels. Therefore, it is possible that the splicing arrays used in that study, though state-of-the-art at the time, did not provide enough sensitivity to detect expression level differences in rare 3UI-containing transcripts among tissues. In addition, splicing arrays can only measure the specific exon inclusion/exclusion events represented on the array, which are only a small percentage of total alternative splicing events occurring in cells 34. The 2010 Illumina BodyMap deep sequencing project (http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-513 accessed on January 11, 2012) provided both enhanced sensitivity and a broader scope of alternate splicing events to address questions of tissue-specific splicing. Our analysis here of these data clearly shows tissue-specific expression of alternatively spliced 3UI-containing transcripts (Fig. 4). For example, CALCA mRNA (Fig. 4; outlined in red) contains six exons and can be alternatively processed to produce distinct mRNAs encoding calcitonin or calcitonin gene-related peptide. The mRNA containing all six exons harbors a stop codon in exon 5, 154 nt upstream of intron 5, and is therefore a potential NMD substrate. This 3UI-containing form is the only CALCA mRNA expressed in the human brain. By contrast, a non-NMD splice-form containing only exons 1–4 predominates in the thyroid. This high degree of variation strongly suggests that there is a functional consequence of including a 3UI in the brain, but not the thyroid. This possibility is consistent with studies showing that the protein encoded by the CALCA gene performs different functions in these two tissues 65.
The third class is constitutively spliced 3UI-containing transcripts. In the Illumina BodyMap deep sequencing data, we have identified 75 human genes for which the only detectable transcript(s) contain 3UIs more than 55 nts downstream of the translation termination codon, and these transcripts are expressed in a tissue-specific manner (Table 1). Previous analyses have identified 152 transcripts containing constitutive 3UIs that are conserved among human, rat, and mouse 51. Interestingly, constitutive 3UI-containing transcripts that are conserved are particularly enriched in brain, testes, and hematopoietic cells 51. We do not yet know the complete functional significance of these enrichments, but as described below, NMD is known to be involved in regulating developmental programs in both neurons and hematopoietic cells.
Numerous studies have taken the experimental approach of inhibiting NMD and examining the resulting changes in mRNA levels and splicing patterns 40, 43–49, 57, 66. By combining these studies, close to 1,000 functional 3UI-containing transcripts have been found to exhibit either increased expression or alternate exon usage upon NMD inhibition. The number and identity of these mRNAs varies considerably by cell type, so the list will likely grow if more cell types are examined. Of course it is still uncertain from those studies which changes in expression and splicing upon NMD inhibition were direct, but NMD inhibition did affect a much higher proportion of 3UI-containing transcripts than non-3UI-containing transcripts 48.
The most dramatic functional enrichment among NMD-affected transcripts is for those encoding RNA-binding proteins 45, 66. Within this class of genes, exons that introduce stop codons upstream of introns are extremely well-conserved 44, 67. The introns to either side of these exons are also well-conserved, suggesting the presence of regulatory elements affecting their alternative splicing. This is because many RNA binding proteins use AS-NMD to regulate their own expression homeostatically. That is, as an RNA binding protein increases in abundance, it increasingly binds its own pre-mRNA and facilitates production of a 3UI-containing form subject to NMD, thereby maintaining protein homeostasis 44, 67, 68. Proteins undergoing this type of regulation include ribosomal proteins 42, 69–71, core spliceosomal proteins 66, 72 and alternative splicing regulators such as hnRNP and SR proteins 45, 67. In fact, every one of the 11 human SR proteins has a 3UI isoform and regulates its own production by AS-NMD 67.
Also enriched amongst NMD targets are mRNAs involved in development and differentiation 45, 46. Consistent with this, NMD-deficient mice fail to complete embryogenesis, with the most prominent defects observed in heart, brain, and hematopoietic development 45, 47, 73, 74. The neural development program incorporates NMD in at least two important ways, both with widespread effects. First, as neurons differentiate, they down-regulate expression of Ptbp1, an alternative splicing factor. One function of Ptbp1 is to alter splicing of transcripts encoding another splicing factor, Ptbp2, such that a 3UI-containing mRNA isoform is produced. Thus, as neurons differentiate and Ptbp1 levels decrease, the non 3UI-containing form of PTBP2 mRNA is produced. As a result, Ptbp2 protein levels increase, leading to a regulated change in alternative splicing of its many targets, many of which are involved in neuronal differentiation 75. Some of these Ptbp2 splicing targets are themselves NMD substrates. For example, PSD-95 (also known as DLG4, SAP90), encoding a critical protein in synaptic function, is transcribed throughout neuronal development. However, during early development it is spliced to include a 3UI and the mRNA is degraded. Psd-95 protein is not detected until later in development, when the splicing pattern shifts in favor of the non 3UI-containing form 76.
NMD is additionally linked to the neural differentiation program through miR-128, a microRNA with enriched expression in the brain. Production of miR-128 increases during neural development and one direct target of miR-128 is UPF1 mRNA. Thus both Upf1 levels and NMD efficiency decrease as neurons differentiate. As a result, hundreds of transcripts that would otherwise be subject to NMD are up-regulated, and most of these encode proteins important for neural function 46.
3UI-containing transcripts are also enriched for mRNAs involved in amino acid metabolism, starvation, ER stress, and hypoxia 40, 47, 49. These enrichments led to the discovery that NMD is inhibited during stress through a mechanism involving phosphorylation of translation initiation factor, eIF2α. As a result, 3UI-containing stress-response transcripts are stabilized, thus promoting stress survival 49.
The examples above highlight several mechanistically distinct ways that 3UIs can be integrated into gene regulatory pathways. Other possibilities also exist, including regulation of EJC deposition, translational regulation, and regulation of other components of the NMD machinery. All such mechanisms are theoretically subject to cell-type, condition-specific, and/or transcript-dependent control. Therefore, if we are to appreciate the full spectrum of these possibilities, researchers need to become more cognizant of the many protein-coding transcripts that contain 3UIs.
As detailed above, a wide range of evidence now indicates that whether or not UTR introns are present can significantly affect gene expression. It is therefore important that researchers be able to identify such introns so that they may study their impact on a particular pathway or gene of interest. For this purpose, a wide variety of sources is available, including RefSeq, Vega, AceView, UCSC Known Genes, H-Invitational, and ENCODE 9, 77–81. Each of these is different in terms of how it approaches the balance between specificity and sensitivity – that is, how it limits its annotations to mRNAs that are functionally relevant (specificity), while at the same time ensuring that as many real mRNAs as possible are represented (sensitivity) 82.
RefSeq achieves high specificity – mRNAs represented in RefSeq have a very high likelihood of being real. Therefore, for 5UI identification, RefSeq is a good starting point. Intronic regions and UTR boundaries are generally well annotated and we have generated a list of all human RefSeq mRNAs with 5UIs, which is publically available 3. However, RefSeq has relatively lower sensitivity and lacks many real transcripts. In particular, transcripts that are scarce or only expressed in a specific cellular context may not be supported by enough sequence-based evidence to be included in RefSeq. Our analysis thus far has found no specific bias or trend toward inclusion or exclusion of 5UIs in RefSeq. However, in the case of alternate promoter usage, which could lead to inclusion or exclusion of a 5UI, one or more alternate forms could be missing. Therefore, because different groups use different data sources and different methodologies, it is worthwhile to query multiple annotations (listed above) in search of all potential 5′-UTRs for a given transcript.
Because NMD substrates are inherently unstable, sequence evidence for them can be particularly limited. Therefore, sensitivity is even more an issue for 3UI transcripts and multiple data sources should always be used to search for them. In addition to the annotation projects listed above, most primary data from deep sequencing studies are publically available and can be accessed directly. This is an extremely powerful way to look for transcripts that are not yet in the composite databases, but can require more in-depth bioinformatics expertise and effort on the part of the user. Of particular use are sequencing data from experimental conditions that are enriched for 3UI transcripts. Experiments listed above that knockdown or inhibit the NMD machinery are a good source. In addition, the ENCODE project now includes RNAseq data from human nuclear RNA, which is publically available (http://genome.ucsc.edu/ENCODE/). We expect these data to be enriched for 3UI transcripts, as NMD occurs exclusively in the cytoplasm.
In addition to their under-representation due to scarcity, 3UI transcripts are also actively suppressed from some annotation pipelines because they are considered nonfunctional. As discussed above, RefSeq's current policy is to designate most 3UI-containing mRNAs as “noncoding” even when a transcript with protein-coding potential is well-supported. The only exceptions in RefSeq are genes for which all available transcripts exhibit 3UIs or those few that have been experimentally verified to produce protein 9. In genome browsers, ORFs are not displayed for RNAs designated as “noncoding.” This may lead researchers into falsely believing that 3UI transcripts contain no ORF and have no protein coding potential. Therefore, when using RefSeq to identify 3UIs, noncoding transcripts must be carefully examined. If these have been designated as noncoding due to predicted NMD targeting, this will be annotated as a “misc_feature” on the transcript record.
For identification of 3UIs in human mRNAs, the Vega genome browser, which displays annotations from the Human and Vertebrate Analysis and Annotation (HAVANA) Group at Wellcome Trust Sanger Institute 77 is currently the best place to start. Like RefSeq, the HAVANA group uses a manual curation process, and its annotations therefore have relatively good specificity. However, rather than designating them noncoding, HAVANA includes 3UI-containing transcripts in its coding sequence database, and simply flags them with the NMD biotype.
Introns in both 5′- and 3′-UTRs influence gene expression in ways that are different from introns in coding regions. Within the context of the 5′-UTR, presence or absence of an intron can dictate the mechanism of mRNA export. The export pathway used might depend on alternate promoter usage and could influence gene expression on several levels, including subcellular localization and translational control. In the 3′-UTR, introns can target the mRNA for degradation by NMD. While we have long appreciated the importance of NMD in quality control, only more recently have we begun to understand that NMD also regulates normal gene expression through functional 3UIs. Therefore, it is now time to change the default assumption that 3UI-containing transcripts are non-coding or nonfunctional. Like miRNA binding sites and 5′-UTR structure, introns in UTRs should be regarded as important cis-regulatory elements that modulate multiple levels of gene expression.