Leslie Turner’s research focuses on identifying and functionally characterizing genes contributing to speciation in rodents. Bettina Harr is using the house mouse as model to study divergence and speciation at the molecular level.
Alternative splicing, the combination of different exons to produce a variety of transcripts from a single gene, contributes enormously to transcriptome diversity in mammals, and the majority of genes encode alternatively spliced products. Previous research comparing mouse, rat and human has shown that a significant proportion of splice forms are not conserved across species, suggesting that alternative transcripts are an important source of evolutionary novelty. Here, we studied the evolution of alternative splicing in the early stages of species divergence in the house mouse. We sequenced the testis transcriptomes of three Mus musculus subspecies and Mus spretus using Illumina technology. On the basis of a genome-wide analysis of read coverage differences among subspecies, we identified several hundred candidate alternatively spliced regions. We conservatively estimate that 6.5% of testis-expressed genes show alternative splice differences between at least one pair of M. musculus subspecies, a proportion slightly higher than the proportion of genes differentially expressed among subspecies. These results suggest that differences in both the structure and abundance of transcripts contribute to early transcriptome divergence.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
A majority of genes have multiple splice forms (Sharov et al. 2005; Kim et al. 2007) and alternative splicing generates a large proportion of the protein diversity visible to natural selection. Novel splice forms can be considered ‘internal paralogues’ because variant-specific exons may be free to vary without affecting the function of the original splice variant (Modrek & Lee 2003). Like gene duplication, alternative splicing enables alterations in protein structure that are spatially or temporally restricted. Thus, alternative splicing could potentially play a major role in adaptation and speciation (Ast 2004; Xing & Lee 2006).
Comparative analyses of alternative splicing in mouse, rat and human have revealed intriguing patterns of transcriptome evolution (Modrek & Lee 2003; Nurtdinov et al. 2003, 2009; Gilad et al. 2005). There is a striking lack of conservation of alternatively spliced exons; only 28% of exons present in minor splice forms (<50% of transcripts) are conserved between mouse and human in contrast with 98% of constitutively expressed and major form exons (Modrek & Lee 2003). Exons which have arisen since the mouse-rat and human lineages split have elevated ratios of nonsynonymous to synonymous divergence rates on average, suggesting novel exons are often subject to positive selection (Wang et al. 2005). Evolution of transcripts at the structural level appears to be a rapid and frequently adaptive process. However, comprehensive studies of alternative splicing evolution so far are based on comparisons of the few taxa with sequenced genomes and extensive EST databases. The recent development of deep sequencing techniques for transcriptomes –‘RNA-Seq’ (Mortazavi et al. 2008) enables comparisons of splice variants in closely related species. Characterizing the early stages of evolution of novel transcripts will facilitate both identification of the precise molecular events causing new alternative splicing events and evaluation of the role of alternative splicing in adaptation and speciation.
Here, we investigate the evolution of splice variation in three recently diverged house mouse subspecies [Mus musculus musculus, M. m. domesticus, M. m. castaneus, diverged ∼500 000 years ago (Boursot et al. 1993)]. Previous studies identified a large set of genes with differential expression in testis among the three subspecies (Rottscheidt & Harr 2007; Voolstra et al. 2007). Detailed analysis of one of these differentially expressed genes, mitogen-activated protein kinase kinase 7 (Map2k7), revealed that differences in expression level were due to the presence of a splice variant in M. m. domesticus that was absent in M. m. musculus (Harr et al. 2006). This result prompted our interest in alternative splicing divergence as an important source of transcriptome variation among subspecies.
We performed Illumina sequencing of the testis transcriptome for the three house mouse subspecies and the closely related Algerian mouse (Mus spretus). We estimate the proportion of exons and genes with alternative splicing differences among subspecies and identify candidate alternatively spliced gene regions.
Materials and methods
Sampled animals include two unrelated males (8–10 weeks old) each from Mus musculus domesticus (Cologne/Bonn area, Germany, provided M. Teschke and C. Pfeifle), M. m. musculus (Vienna, Austria, provided by K. Musolf) and M. m. castaneus (Taipei, Taiwan, provided by A. Yu) and one Mus spretus male (Spain). All individuals are lab-bred offspring of wild-caught parents.
Illumina sample preparation
We extracted RNA from flash-frozen testis tissue using TRIZOL (Invitrogen), and isolated mRNA from total RNA using the μMACS mRNA isolation kit (Miltenyi Biotec). Sample integrity of total RNA and mRNA were verified using the RNA 6000 Nano chip kit on an Agilent Technologies 2100 Bioanalyzer. We synthesized cDNA from mRNA using Superscript III reverse transcriptase (Invitrogen) and a 10:1 primer mixture of random dodecamers and oligo(dT)18. Library preparation and deep sequencing using a Genome Analyzer II (Illumina) was performed by the National Center for Genome Resources (Santa Fe, NM) following paired-end sequencing and mRNA sequencing protocols from Illumina. Samples were sequenced in either one or two flow cells. The data discussed in this publication have been deposited in the NCBI Gene Expression Omnibus and are accessible through GEO Series accession number GSE18905 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18905).
Mapping of reads. Our data set consists of 46 bp sequences, hereafter ‘reads,’ derived from mRNA fragments. For each sample we used the program Tophat Version 1.0.10 (Trapnell et al. 2009) (http://tophat.cbcb.umd.edu/) to perform spliced alignments of the reads against the house mouse reference genome (version NCBIM37.50, downloaded from ENSEMBL at ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/). We excluded the small portion of the genome that is unmapped from our reference sequence file. We ran Tophat in the ‘paired-end mode’ using appropriate settings [i.e. a minimum distance between paired-end reads of 120 bp and a maximum distance of 500 000 bp (the value recommended in the Tophat manual for mammals)]. Tophat does not allow for indels during the mapping of reads. By default, no more than two mismatches between read and reference are allowed in the first (5′) 28 bp of the read. We only report results for reads that mapped in the expected paired-end fashion to a single unique location in the genome (i.e. Tophat’s max-gene-family option was set to 1). We used the default settings for all other Tophat options.
Exon assignment and gene expression measures. We assigned each read, on the basis of its ‘starting’ position (i.e. the first base on the forward reference strand where alignment occurs), to annotated exons and genes by querying the ENSEMBL database using the Perl API Installation available on the ENSEMBL website (http://www.ensembl.org/info/docs/api/api_installation.html) and custom Perl scripts. Exon-specific expression measures are sums of all reads mapping to a specific exon. Gene-specific expression measures are sums of all reads mapping to all exons that are assigned to a specific gene. For comparison across samples, read counts were normalized relative to the total number of mapped reads in a sample. We retrieved the largest number of mapped reads for the M. spretus sample. Thus, the correction factor was set to 1 for M. spretus. Corrected read counts for the other samples were then calculated as:
where ‘sample’ is the sample under consideration. RNA-Seq gene expression levels need to be corrected for gene length to be comparable to expression levels measured on microarrays. From the ENSEMBL annotation, we estimated the length of each gene by taking the average of all known transcripts per gene. We normalized read counts to correct for gene length in the following way:
Corrected gene expression measures were compared to expression levels obtained from Affymetrix microarrays (GeneChip Mouse Genome 430 2.0 array) performed on the domesticus and musculus individuals only (Harr et al. unpublished results). We included all genes for which there was at least one RNA-Seq read in one of the domesticus or musculus samples and for which an Affymetrix probe set was available. For genes targeted by more than one Affymetrix probe set, we randomly selected a single probe set to represent that gene. Altogether, 14 397 genes were analysed for concordance in gene expression across platforms.
Detection of splicing differences across subspecies. We pursued two different strategies to identify regions with alternatively splicing differences among subspecies. The first strategy is based on the known ENSEMBL exon annotation (termed ‘exon analysis’ in the following) and the second strategy is annotation independent (termed ‘window analysis’).
To detect transcript structure differences involving annotated exons, we assigned uniquely mapping paired-end reads to exons as described above. Read counts were corrected for the differences in the total number of mapped reads among subspecies (see above). We removed all exons that did not have at least five reads in at least one M. musculus individual. To identify significant differences in read counts among subspecies for each exon, we performed one-way anova analyses in the statistical software package r (http://www.r-project.org/) using corrected read count as dependent variable and the two samples for each of the three subspecies as replicates (hence three error degrees of freedom). The significance level after correction for multiple testing using a 10% false discovery rate (FDR) was 0.0001 [calculated using the program fdrtool (Strimmer 2008), http://strimmerlab.org/software/fdrtool/html/fdrtool.html].
We classified genes as significantly differentially spliced among subspecies if they contained at least one significant exon, and at least three times as many nonsignificant exons as significant ones. Genes with a high proportion of significant exons are probably differentially expressed among subspecies, rather than differentially spliced, and these genes are not the focus of this study. Given these criteria, we can detect alternatively spliced regions only for genes that have at least four expressed exons. The prevalence of transcript structure differences across subspecies was thus determined with respect to the total number of genes with ≥5 reads in any subspecies for at least four exons.
We also identified transcript structure differences between subspecies independent of any annotation. To do this, we divided the total length of each chromosome into 500 bp nonoverlapping windows. Each read was assigned to a single window, on the basis of its starting position (see above). We performed read assignment for each M. musculus subspecies and M. spretus separately. Windows that did not contain any reads in any sample were excluded from further analysis. The number of reads within a window was standardized for differences in total number of mapped reads between subspecies as described above. To identify potential transcript variation among subspecies, we looked for candidate regions that may contain ‘exons’ transcribed in some subspecies but not others. First, we identified windows with significant differences in read coverage among subspecies (not including spretus) using one-way anova as detailed above. The significance level after correction for multiple testing using a 10% FDR was 0.00057. Precise estimation of expression levels in a specific region (i.e. exon) on the basis of read counts requires very high sequencing coverage, higher than the coverage achieved for most genes in our data set. There is more variation in read counts among individuals within subspecies for low coverage windows, thus it is rare to achieve the 10% FDR level even when differences among subspecies are large. Therefore, we applied a second set of criteria to identify differentially expressed windows, using a less stringent P-value (P <0.05) but requiring a large difference in coverage (>5-fold) among subspecies. All three subspecies are included in the anova, thus the differentially expressed windows include those that were significant at the P <0.05 level and showed at least fivefold difference in coverage (averaged across the two samples within subspecies) in at least one of the three possible subspecies comparisons (i.e. domesticus vs. musculus, domesticus vs. castaneus, musculus vs. castaneus).
Our goal was to detect alternatively spliced regions contributing to differences in transcript structure among subspecies, rather than simply identify transcripts that have quantitative differences in expression among subspecies. Thus, we employed a strategy that considers the significance of all windows in a 10-kb region. The strategy is illustrated in Fig. 1 for a hypothetical gene region. When we detect a significant window, we scan neighbouring windows within 5 kb of the significant window on either side. Flanking windows can show coverage (at least five reads) that is not significantly different between subspecies (‘nonsignificant windows’), they can show coverage that is significantly different between subspecies (‘significant windows’) or they can show no coverage. For each significant window, we determined the ratio of nonsignificant to significant windows in the surrounding 10 kb region (‘significance ratio’). If this ratio is high (Fig. 1a), we considered this event a likely splicing difference between subspecies. If the significance ratio is small (Fig. 1b), we considered this window more likely to be part of a transcript with quantitative differences in expression. If the significance ratio is 0, we did not count the event (Fig. 1c).
We chose a 10-kb region for determining the significance ratio on the basis of typical gene structure characteristics in the mouse genome; median gene size in mouse is 13 542 bp (ENSEMBL), median exon size is 129 bp (ENSEMBL) and median intron size is 1093 bp (Modrek & Lee 2003). Thus, a 10-kb region is large enough to include several exons within the same gene and provide a meaningful significance ratio, but small enough that it is unlikely to include exons from adjacent genes; maximum distance from the target window considered is 5 kb and median intergenic distance is 31 kb (ENSEMBL). Given variation in gene structure, however, any criteria chosen will yield some false positives as well as false negatives. Nevertheless, this screen approach will provide an estimate of the prevalence of alternative splicing differences among subspecies as well as identify candidate alternatively spliced regions.
In analogy to the exon analysis, the window analysis requires at least three times as many nonsignificant as significant windows in a 10-kb region surrounding the central window to be classified as alternatively spliced. Thus, the proportion of transcript structure differences among subspecies was calculated with respect to the total number of expressed windows that were flanked by at least three additional expressed windows within 10 kb; that is, all windows from regions where expression levels were sufficient such that we had power to detect alternative splicing.
We annotated the list of windows with sufficient expression using the ENSEMBL database and the Perl API Installation. Specifically, for each 500 bp window we recorded the number of bases of coding (including UTR sequence) and noncoding sequences (intergenic and intronic sequence). We assign a window to a known gene if any base covered by the significant window is annotated as ‘coding’ or ‘intronic’ in the reference mouse genome.
Analysis of splice junctions. To further assess transcript variation, we looked for splice junctions that vary among subspecies. Tophat identifies splice junctions by first mapping all reads to the genomic reference sequence and setting aside reads that do not match. In the following step, the program identifies ‘islands’ of matching reads, representing expressed regions (i.e. putative exons) and assesses whether previously unmatched reads span any of these islands. Tophat calls a splice junction if at least one read is found that spans two expressed islands (hereafter ‘spliced reads’). By default Tophat finds any reads that span splice junctions by at least five bases on each side. Tophat also considers paired-end read information when available to support detected splice junctions. We considered a splice junction expressed in a given subspecies if it was covered by ≥5 spliced reads. For each junction expressed in at least one M. musculus sample, we compared read numbers for all samples, again adjusted for differences in overall sample coverage, using one-way anova. As before, we identified splice junctions differing among subspecies using two different significance criteria. As none of the splice junctions satisfied the criteria of the 10% FDR cutoff (P <5 × 10−6), we report values for P < 0.001 as a relatively stringent significance threshold. In addition, we selected all splice junctions with P <0.05 and >5-fold coverage difference among subspecies. Significant splice junctions were assigned to transcripts using the ENSEMBL database and classified as ‘novel’ if either the 3′- or 5′-splice site was not annotated.
A test case. Previous studies of the Map2k7 gene (Harr et al. 2006) revealed a large difference in transcript abundance in the testis between two house mouse subspecies, domesticus and musculus. A highly expressed ∼1.6 kb transcript was present in domesticus but absent in musculus, suggesting alternative transcript usage evolved between these subspecies. We used the known difference between subspecies in Map2k7 transcripts as a test case for our detection algorithm. To visually display Illumina read coverage in the Map2k7 region we used the UCSC custom browser option (http://genome.ucsc.edu/cgi-bin/hgCustom) and plotted read coverage in the Map2k7 region along the genomic DNA.
We performed Illumina sequencing of testis RNA from three subspecies of Mus musculus (two biological samples each) and Mus spretus (one sample). For each sample, 6–10 million high quality reads, each 46 bp in length, were generated (Table 1). Approximately 50% of reads mapped to unique locations in the reference house mouse genome, in the expected ‘paired-end’ orientation, with no more than two mismatched bases. About half of the mapped reads were perfect matches, ∼30% mapped with 1 bp mismatch and the remaining ∼20% mapped with two mismatches to the reference genome (Table 1 and Fig. S1, Supporting Information). Mapping success and mismatch proportions were similar for all samples in the study and did not reflect phylogenetic distance to the reference genome, which is predominantly M. m. domesticus in origin (Yang et al. 2007). For example, mismatch proportions are not elevated in M. spretus, the outgroup to all M. musculus subspecies, nor are they lowest in M. m. domesticus samples. Thus, the mapping approach employed works well for closely related taxa at various phylogenetic distances from the reference genome and sequence divergence is unlikely to have a significant effect on estimates of read coverage.
Table 1. Illumina sequencing data obtained from Mus testis transcriptomes. Samples include two individuals each of Mus musculus castaneus (CAS), M. m. domesticus (DOM) and M. m. musculus (MUS) and one Mus spretus (SPR) individual. ‘No. mapped reads’ excludes repetitive reads and all hits mapping to multigene families
Total no. reads
No. mapped reads
% mapped reads
% perfect matches
% 1 mismatch
% 2 mismatch
No. spliced reads
% spliced reads
6 067 213
2 955 710
6 247 866
3 271 464
8 245 978
3 443 840
8 834 374
4 295 660
7 199 979
3 660 096
5 832 861
2 985 568
10 127 623
4 771 692
Most reads (40%, Fig. S2, Supporting Information) mapped to annotated coding regions, followed by introns (18%), 3′-UTR (16%) and intergenic regions (15%). A smaller proportion of reads mapped to 5′-UTRs and noncoding (i.e. RNA) genes. The large number of reads mapping to intronic and intergenic regions implies that many transcriptionally active regions are not currently annotated. About 7% of all mapped reads represented ‘spliced reads’ (Table 1).
We determined the distribution of reads across transcripts to assess whether our sequencing method produced even coverage (Fig. S3, Supporting Information). Reads for 3′ exons were highly overrepresented, while coverage was roughly equal across the rest of transcripts. This pattern suggests that the oligo(dT)18 primers were much more efficient during cDNA synthesis than random dodecamers. We suggest that using random primers alone, or at a greater excess than the 10:1 mixture used here, would produce more even transcript coverage.
Of the 28 517 annotated genes in ENSEMBL (including 3287 tRNA, rRNA or snRNA genes), 17 000–18 000 are represented by at least one read in each individual (Table 2). At a higher threshold of 10 reads, we could detect ∼13 000 genes per individual.
Table 2. Number of genes detected at different total read coverage levels in samples of Mus musculus and Mus spretus (sample abbreviations as in Table 1)
To determine how the RNA-Seq results compare to previous gene expression data for these subspecies, we compared RNA-Seq gene expression levels to Affymetrix microarray data for two of the three subspecies (domesticus and musculus) for which we have data from both platforms for the same individuals. Including all genes that matched the criteria as described in Materials and methods, the Spearman rank correlation coefficient was r∼0.4 (0.38 for musculus and 0.36 for domesticus) and highly significant (P <0.001). However, we noticed that genes that show very high read counts in our RNA-Seq data set show relatively lower intensity values on the microarray. Excluding the 20 genes with the highest read counts increased the correlation coefficient dramatically (r∼0.6, Fig. 2). Most likely, this behaviour occurs because at high expression levels signal intensities on microarrays saturate and thus are not proportional to expression level. Apart from this effect, we found that the two platforms produce similar estimates of gene expression level.
Variation in transcript structure among subspecies
The aim of the study was not to detect differential expression at the whole-gene level, but to detect differences in transcript structure due to splicing – that is, differential expression limited to a small region within a gene. We pursued two approaches to identify differences in transcript structure between subspecies, one that is based on annotated exons in the mouse genome (‘exon analysis’) and one that is annotation independent and enables identification of differences in both annotated and novel exons (‘window analysis’). For each analysis, we identify alternatively transcribed regions using two different significance criteria, the first is a stringent P-value that corresponds to a 10% FDR and the second is a P-value <0.05 plus >5-fold difference in expression of alternative transcript structures.
For the exon analysis, read coverage was sufficient to test for alternatively splicing of 80 158 exons, representing 8457 genes. Employing the more stringent cutoff (<10% FDR, P <0.0001), we identified 48 (0.06%) alternatively transcribed exons from 46 (0.05%) genes (Table 3). Using the second significance criterion, we identified 1896 (2.4%) exons, affecting 1491 (17.6%) genes. The vast majority of these genes (1191, 80%) have a single alternatively transcribed exon, out of an average 12 exons/gene [by comparison, transcripts from a single gene within a species differ on average by 2.4 exons, median 1 exon (determined for transcripts in ENSEMBL)]. However, we also found a substantial fraction of genes with two (231, 15%) or three variable exons (51, 3.4%). Only 20 (1.3%) genes showed differences in transcript structure affecting more than three exons.
Table 3. Candidate alternatively spliced regions among Mus musculus subspecies. The number of intervals and number of genes or intergenic regions represented are reported at two significance levels for each of three analyses (see Materials and methods)
Splice junction analysis
*Exons with ≥5 reads in at least subspecies, and from genes with at least three other expressed exons.
†Genes with ≥5 reads in at least one subspecies for at least 4 exons.
‡Windows with ≥5 reads in at least one subspecies and which showed at least three additional expressed windows within 10 kb.
§Genes with ≥5 reads in at least one subspecies for at least four windows.
¶Splice junctions with ≥5 reads in at least one subspecies.
P <0.05, FC >5
For the window analysis, we systematically assessed read coverage over the whole genome in 500 bp windows and employed the ratio of nonsignificant:significant windows (‘significance ratio’) within 5 kb on either side of a significant window (Fig. 1) to identify alternatively spliced regions. We had power to test a total of 141 889 windows for alternatively splicing, i.e. there were at least three other expressed flanking windows. These windows represented 12 434 annotated genes (as well as intergenic regions). Using the first significance criterion (<10% FDR, P <0.00057) we found 480 (0.3%) regions with a significance ratio ≥3:1 (Table 3). These events affected 279 (2.2%) genes and 190 intergenic regions. Using the second significance criterion and a significance ratio ≥3:1, we identified 4050 (2.8%) windows affecting 2388 (19.2%) genes and 1132 intergenic regions.
Analysis of splice junctions
In addition to analyzing differential expression of regions within genes, we directly examined splice junctions. In total, we identified 115 593 unique splice junctions represented by at least one read in at least one subspecies, and 44 974 junctions had sufficient coverage (≥5 reads) to be assessed for variation among Mus subspecies. The vast majority (>98%) of splice junctions are already annotated in ENSEMBL, only 2% represent novel junctions. Although there were no significant differences among subspecies in splice junctions using a 10% FDR cutoff, 186 (0.4%) splice junctions are significant at P <0.001 and 1056 (2.3%) are significant at P <0.05 and larger than fivefold difference in read coverage (Table 3). The reason for the absence of significant splice junctions under the 10% FDR criterion could be that read coverage across splice junctions shows especially large within-group variation that makes the test for the between subspecies differences conservative. This is expected because the interval where read coverage is compared is much smaller than for the exon and window analysis; start positions for reads mapped to a specific splice junction can only differ by a maximum of 36 bp (such that ≥5 bp are mapped to each side of the junction), whereas they can vary up to ∼500 bp for the window analysis and up to the entire exon length for the exon analysis.
A comparison of genes with evidence for alternative splicing identified by the three analyses (Fig. 3), shows the overlap among methods is substantial, but not complete. The exon analysis had the largest proportion of regions validated by other approaches (48%), followed by the splice junction analysis (46%) and the window analysis (32%). This is the expected pattern, because annotated exons represent a subset of the regions analysed in the window analysis. Overall, 13 466 genes had sufficient read coverage to be evaluated for alternative splicing by at least one method (i.e. ≥5 reads for ≥4 exons, ≥5 reads for ≥4 windows and/or, ≥5 reads for a splice junction). A total of 3730 (28%) of these genes have evidence for alternative splicing from at least one method, 873 (6.5%) genes from two methods and 123 (0.9%) genes from all three methods (Table S1, Supporting Information).
A test case, Map2k7
The Map2k7 gene shows a dramatic difference in transcript usage in the testis of wild domesticus and musculus (Harr et al. 2006). Northern blots showed that there is a highly expressed ∼1.6-kb transcript in domesticus and a ∼4-kb transcript present at much lower levels in both domesticus and musculus. The precise exon structure of the short transcript is unknown and no transcript of that size is annotated in ENSEMBL.
Here, we aimed first to confirm that our analysis methods using Illumina read coverage detect the difference in Map2k7 splice forms, and second to identify the exon structure of the short transcript. We scanned the Map2k7 region and 10 kb of upstream and downstream flanking sequence and identified three 500 bp intervals with significant difference in read coverage among subspecies (Table 4). The 10-kb regions containing these three windows shows a high significance ratio. The exon analysis identified five significant exons in the Map2k7 region, all of which co-localize with two of the intervals identified using the window method. Thus, both the exon and window analyses successfully detected the splice variation in Map2k7.
Table 4. Candidate alternatively spliced intervals in the Map2k7 region on chromosome 8 identified using two analysis methods. ‘Position’ is the first position of the significant window (window analysis) or the first position of the annotated exon (exon analysis). For the exon analysis, the length of annotated exons with identical starting positions is reported. Multiple exons with identical starting positions and read counts are reported once. The numbers in the table represent the number of reads mapping to that interval (corrected for differences in overall read number) for each individual (sample abbreviations as in Table 1). ‘Significance ratios’ are the ratios of nonsignificant: significant windows in a 10-kb region surrounding the window of interest (window analysis) or in the entire gene (exon analysis). Significance values are for comparisons of Mus musculus subspecies only
Position (Chr. 8)
Significance ratio (nonsign:sign)
4245695 (146 bp)
4245695 (412 bp)
The top part of Fig. 4 shows read coverage graphically for each of the M. musculus subspecies samples and M. spretus. The first two significant regions from the window analysis and the significant exons from the exon analysis localize precisely to UTR regions of various annotated transcripts in ENSEMBL. In these windows, domesticus has high coverage and musculus low coverage. The third significant window from the window analysis maps to an annotated exon of two different transcripts, but shows relatively low coverage in all subspecies, despite being significantly higher in domesticus compared to the other Mus subspecies.
The significant regions identified by window and exon analyses do not co-occur in any single transcript annotated in ENSEMBL (Fig. 4). However, one mRNA sequence available in GenBank contains the first and second significant windows and the significant exons from the exon analysis. This transcript (U93030) has been characterized as testis specific in laboratory strains of mice (Tournier et al. 1997) and has the same length (∼1.6 kb) as the transcript previously detected in wild domesticus (Harr et al. 2006). Laboratory strains of house mice are known to be primarily of domesticus origin (Yang et al. 2007). Thus, it is highly likely that the transcript that we identified in wild domesticus is the same as U93030.
Like M. m. domesticus, the outgroup species M. spretus shows high coverage in the first two significantly differentially expressed regions, suggesting that presence of the short transcript is ancestral and that this transcript has been downregulated in musculus and castaneus. The third significant region identified in the window analysis is not associated with transcript U93030. This region was not significant in the exon analysis, suggesting it may be a false positive or is included in an alternate splice form expressed at much lower levels.
To our knowledge, this is the first comparative analysis of alternative splicing on the basis of next generation sequencing data. Taking advantage of this new technology enabled us to investigate the early stages of transcriptome divergence by comparing sequences from closely related mammals – which was only possible previously for human and chimpanzee (Calarco et al. 2007), because genome sequences for both species are available. We identified a substantial number of genes with evidence for divergence in transcript structure among house mouse subspecies on the basis of analyses of read coverage for annotated exons, unannotated transcribed regions and splice junctions. These results suggest that alternative splicing contributes substantially to transcript variation among recently diverged taxa.
Advantages and challenges of NGS approaches
Until recently, analysis of alternative splicing was possible only by sequencing large EST libraries. Recent development of exon microarrays allowed for comparison of splicing patterns among individuals for a few species with complete genome sequences (Johnson et al. 2003; Clark et al. 2007; Kwan et al. 2008), but unbiased interspecific comparisons using commercially available exon microarrays is not possible. Microarray probes are designed using transcript sequences from one species, thus only sample exons transcribed in that species. Furthermore, sequence variation among taxa can contribute to hybridization differences (Gilad et al. 2005). If complete genome sequences for both taxa are known, probes with mismatches can be excluded (Calarco et al. 2007) or custom arrays with sequences from both species can be produced. For the vast majority of animal taxa where these requirements are not met, exon arrays are not an appropriate method to detect splice variation. By contrast, next generation sequencing is an unbiased method for comparing transcript sequences, because no a priori selection of regions likely to show variation is required. Thus, it can be applied to nonmodel organisms or samples from natural populations and is not limited to analysis of annotated exons. At present, some bias persists in the analysis of short read NGS data, because reads must be mapped to sequenced genomes generated from one or a few individuals. In the present study, sequence divergence probably did not have a large impact on results; proportions of reads that matched the genome (mostly Mus musculus domesticus origin) perfectly and with one or two mismatches were similar in all taxa (Table 1, Fig. S1, Supporting Information). As NGS read lengths increase, any bias will be alleviated as de novo assembly of each individual transcriptome prior to comparative analysis will be feasible.
One limitation of alternative splicing analysis using microarrays or next generation sequencing approaches is that reconstruction of entire transcripts is not yet possible. For example, for our test case Map2k7, we were only able to identify the specific transcript that differed among subspecies because the window-specific differences in read coverage matched only one of the transcript sequences available in GenBank. Even with deep sequence coverage, reconstruction of individual transcripts from short reads alone might be impossible, as several alternative transcripts can be transcribed at the same time in a single tissue within an individual. Longer read-lengths, explicitly including paired-end sequence information to generate transcript models, and read depths sufficient to cover all splice junctions from every alternative transcript within an individual will resolve these problems in the future (Mortazavi et al. 2008). Advanced statistical methods for isoform deconvolution from RNA-Seq and microarray data are also beginning to emerge (Hiller et al. 2009).
Another challenge is distinguishing differences between species due to alternative splicing from quantitative differences in expression level of the same splice form. Sequence depth required for precise estimation of expression level of each exon is much higher than is necessary for determining expression levels for whole genes, equivalent to microarray data (Marioni et al. 2008). Developing criteria that can reliably identify splice differences on the basis of limited sequence coverage is necessary such that low abundance transcripts are not excluded and to enable cost-effective studies with reasonable sample size. We used a gene-wide (exon analysis) or local ratio (window analysis) of nonsignificant: significant regions to identify candidate alternatively spliced exons. We chose the region size and significance ratios on the basis of typical gene structure characteristics in the mouse genome, but the specific values are arbitrary, thus our results likely include some false positives and exclude some true differences. The concordance in genes identified by the three analyses performed (Fig. 3), which each used somewhat different significance criteria, suggests a substantial proportion of these candidate regions are true splice differences. Furthermore, we successfully identified splice variation in a test case, Map2k7, for which there is a known splicing difference among Mus subspecies (Harr et al. 2006). Both the exons and annotation-independent analyses identified multiple alternatively spliced regions of Map2k7 (Table 4) at the P < 0.05, FC >5 significance level threshold. The Map2k7 transcript difference was not detected using a 10% FDR significance threshold, suggesting this cutoff might be too conservative. This study was designed as a screen to assess the prevalence of alternative splicing among subspecies and identify candidate alternatively spliced regions for further study. The 123 genes identified by all three analyses are particularly strong candidates (Table S1, Supporting Information). Evaluation of candidate regions by qRT-PCR and subsequent comparison of gene structures of validated regions vs. false positives, may provide useful information to refine the significance criteria employed in future NGS studies.
Some differences in transcript structure identified here likely reflect splicing errors rather than functional alternative splice forms. The annotation-independent analysis may be particularly prone to false positives due to splicing errors because the large proportion of significant windows that are not annotated (30–40%) have no evidence for function. On the other hand, the available annotation is biased towards transcripts from only one subspecies, M. m. domesticus. We included two individuals for each subspecies, which should reduce the number of splicing errors classified as true splicing differences, but validation by other methods or in additional individuals is particularly important for these putative novel exons.
Alternative splicing evolution in closely related taxa
Given the challenges and uncertainties of the NGS approach employed, our estimates of the proportion of genes with alternative splicing differences among M. musculus subspecies are preliminary, but provide a first indication of the prevalence of evolutionary change in transcript structure at this taxonomic level. Depending on the analysis method and significance threshold applied, 0.06–2.8% of exon/windows and 0–2.3% of splice junctions show evidence for alternative splicing. Overall, 28% of the genes for which we have power to detect transcript structure changes across subspecies indeed show evidence for such events from at least one analysis. The 6.5% genes identified by more than one analysis have stronger evidence for alternative splicing, thus we propose that considering just these genes provides a conservative estimate of splice form variation among subspecies. The exon and window analyses identified very similar proportions of alternatively spliced genes using the lower stringency significance criteria (17.6% and 19.2% respectively). Increased sequencing depth may reveal that these higher values are closer to the true proportion, since it is likely that we missed some transcript differences in genes with lower expression.
By comparison, ∼4% of testis-expressed genes are differentially expressed among these subspecies (P <0.01, fold-change >2.5, Rottscheidt & Harr 2007). These results suggest that the structure and abundance of transcripts evolve at similar rates, and both contribute substantially to transcriptome variation among these recently diverged taxa.
Our findings in mice are quite similar to the only published genome-wide analysis of alternative splicing differences between closely related mammals; a comparison of human and chimpanzee transcriptomes using exon microarrays, which revealed that 6–8% of surveyed exons show pronounced splicing differences, involving ∼4% of profiled genes (Calarco et al. 2007). The similarity between the chimp-human proportion and the 6.5% estimate from our combined analyses is a further indication that this estimate is conservative. Given equal rates of splicing divergence, we would expect to find more changes in mice, because of the larger number of generations separating the taxa (i.e. 1.5 Mio generations separating subspecies of house mice vs. ∼300 000 generations separating humans and chimpanzees).
Our finding that differences in transcript structure are common among M. musculus subspecies adds to a growing body of evidence showing alternative splicing contributes substantially to transcriptome diversity in mammals at all scales – within individuals, within species, and among species at various levels of divergence. Detailed comparisons of divergence in nucleotide sequence (e.g. Elmer et al. 2010), gene expression level (e.g. Goetz et al. 2010; Wolff et al. 2010) and transcript structure in closely related species will shed light on the relative importance and interaction of these three modes of trancriptome divergence. For example, do bouts of adaptive evolution tend to occur through change in one or a combination of these modes? Does the predominant mode of adaptive evolution differ across tissues or functional classes?
Alternative splicing may also have important implications for research aimed at identifying the genetic basis of reproductive isolation. For example, reduced hybrid male fertility is a major reproductive barrier maintaining subspecies distinctness in M. musculus (Britton-Davidian et al. 2005; Good et al. 2008). Sterile males often have abnormal testis morphology and defects during spermatogenesis; thus, divergence in the testis transcriptome likely contributes to reproductive isolation among these taxa. Rapid, adaptive evolution of testis proteins has been documented at the sequence level in the Mus lineage (Good & Nachman 2005; Turner et al. 2008) and hundreds of genes are differentially expressed among M. musculus subspecies (Rottscheidt & Harr 2007; Voolstra et al. 2007). In the present study, we show that splice form variation among subspecies is common for testis-expressed genes. Taken together, these studies suggest the testis transcriptome is diverging rapidly among subspecies through all three modes, and thus they should all be considered when searching for the genetic factors contributing to hybrid male sterility and speciation.
We demonstrate that next generation sequencing can be successfully employed to characterize genome-wide differences in alternative splicing among closely related species. We find that evolutionary changes in alternative splicing are prevalent in the early stages of species divergence in house mice and make a substantial contribution to testis transcriptome variation. Future analyses at greater sequencing depth and in additional taxa will yield important insights into the evolution of gene regulation and the relative contribution of changes in nucleotide sequence, transcript abundance and transcript structure to proteome variation and adaptive evolution.
We would like to especially thank Cole Trapnell for his prompt and competent support on using his software Tophat. T. Bayer wrote Perl scripts to query the ENSEMBL annotation. T. Price gave helpful statistical advice and comments on the manuscript. We thank A. Yu, M. Teschke and C. Pfeifle for wild house mice and D. Tautz for logistical support during the project. This work was supported a DFG grant to B. Harr (SFB-680).
Conflicts of interest
The authors have no conflict of interest to declare and note that the sponsors of the issue had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.