The advent of genome-scale technology to understand genome sequence, organization and function has revealed a large number of non-protein coding RNA species and their functional roles in genome organization and regulation. Small nucleolar RNAs (snoRNAs) are a prominent class of non-protein coding functional RNAs. These RNA species are usually 60–150 nucleotides in length, and presently known to be involved in the post-transcriptional modification, especially methylation and pseudouridylation of other RNA species like rRNAs [Bachellerie et al., 2002]. The extent of these modifications differs according to phylogenetic kingdoms and species [Decatur and Fournier, 2002]. On the basis of distinct conserved sequence motifs and functional role, snoRNAs have been classified into two major groups, the C/D box snoRNAs (SNORDs) and H/ACA box snoRNAs (SNORAs). C/D box snoRNAs are involved in 2′-O-ribose methylation of cellular RNA and the H/ACA box snoRNAs provide the site for pseudouridylation of the target RNA [Balakin et al., 1996]. Both groups of snoRNAs interact with their respective targets in a sequence-specific manner and the target specificity is brought about by sequences adjacent to the conserved motifs or boxes. Recent reports have also suggested various regulatory functions for snoRNAs other than guide modifications of other cellular RNAs. These include ribosome synthesis [Kiss, 2002] and regulation of alternative splicing of the trans-gene product [Kishore et al., 2010], miRNA-like antisense silencing [Brameier et al., 2011]. Recent reports also suggest snoRNAs could in fact give rise to other functional ncRNAs like microRNAs [Ender et al., 2008].
snoRNAs associate with a number of proteins to form functional small nucleolar ribonucleoprotein particles (snoRNP) which are the functional units which are involved in RNA modifications. snoRNAs home the snoRNPs to their respective targets and this is thought to be specific and governed by sequence complementarities [Reichow et al., 2007]. Recent reports have revealed the association of snoRNAs in disease processes and suggested the role of mutations in these loci in several diseases like non-small-cell lung cancer (NSCLC) [Liao et al., 2010], breast cancer [Dong et al., 2009; Askarian-Amiri et al., 2011] and Prader-Willi syndrome (PWS) [Sahoo et al., 2008; Ding et al., 2008].
Though genomic variations have been extensively analyzed in protein-coding genes, there has been a paucity of a systematic collection of genomic variations in non-coding RNA and their analysis. The availability of variation information on a genome-scale for a number of individuals from multiple populations through large collaborative initiatives like the HapMap [The International HapMap Consortium, 2003] and the 1000 Genome projects [Hayden, 2008] and personal genome projects around the world now enables one to look at a genome-wide landscape of variations. There are only a few earlier studies on variations in non-coding RNAs [Iwai et al., 2005]. It is thus imperative to collect genomic variations in non-protein coding loci and create novel tools and methods to systematically analyze their potential functional impact. We have earlier performed a genome-wide study evaluating the role of genomic variations in miRNAs which are a major and well studied class of non-coding functional RNAs [Bhartiya et al., 2011]. Hiard et al., 2010 have discussed variants in miRNA precursors which were predicted to perturb miRNA-mediated regulation, aimed at providing a bioinformatics tool to predict these polymorphisms. Recently Johnson et al., 2011 determined how frequently variants could alter RNA structure and studied SNPs genome-wide in mRNAs and other functional RNAs like IRES, SECIS, miRNAs, and snoRNAs. Similarly Halvorsen et al., 2010 have performed an analysis of 514 disease associated SNPs in untranslated regions in 350 RNAs and discussed the structural consequence of a specific SNP in the regulatory RNA.
Here we systematically collect and curate genomic variations in snoRNA loci to create a comprehensive resource for variations conforming to the Human Genome Variation Society (HGVS) (http://www.hgvs.org/mutnomen/) standards for nomenclature of genomic variations. We also used computational tools to analyze the potential impact of variations on the secondary structure of the snoRNAs, which are critical for their functionality. We further extend the analysis to integrate information from the HapMap and the 1000 genome projects to evaluate selection of these variations in different populations. To the best of our knowledge, this is the most comprehensive collection of genomic variations in snoRNA loci and one of the first systematic analyses of the impact of genomic variations in snoRNAs. This resource is publicly available at http://genome.igib.res.in/snolovd and is open to contributions from the community.
MATERIAL AND METHODS
Sequence and Variation Datasets
All the available snoRNAs present in January 2011 update of HUGO Gene Nomenclature Committee (HGNC) resource were collected. The corresponding RefSeq IDs of each snoRNAs were used to look for any reported validated single nucleotide variations in the respective snoRNAs from the database of Single Nucleotide Polymorphisms (dbSNP build 132 on Feb. 2009 assembly of human genome (GRCh37/hg19)). The variant annotations were systematically converted to nomenclature conforming to the HGVS standards. Variations were also curated from literature and converted to the HGVS nomenclature before submission to the locus-specific variation database. The snoRNA sequences were downloaded from NCBI (National Center for Biotechnology Information) (http://www.ncbi.nlm.nih.gov/) using their respective IDs.
Locus Specific Variation Database
We used the open-source Leiden Open variation database package (LOVD) [Fokkema et al., 2005], which is a popular interface for systematic curation and organization of variation data. We selected LOVD primarily because of the ease of use, the additional functionalities which enable creation of custom data fields, the implementation of different modules for error-checking and analysis, and its popularity among the community. The implementation used build 20 of the LOVD v.2.0. The web server was based on Apache and the database backend was implemented in MySQL. A custom search engine was also implemented in CGI-Perl with a custom interface which enables ease of search for relevant data.
Effect of Single Nucleotide Variations on RNA Secondary Structure
We employed the SNPfold algorithm (http://ribosnitch.bio.unc.edu/snpfold/SNPfold.html) [Halvorsen et al., 2010] to evaluate the effect of single nucleotide variations on RNA secondary structure (Fig. 1). SNPfold calculates the partition function or a matrix representation of the base-pairing possibilities of each base of the RNA sequence under consideration. SNPfold has been previously successfully employed to analyze the consequence of single nucleotide variations in modulating the structure of RNAs and their potential phenotypic associations. To evaluate the structural changes, we used the sequence derived from the NCBI database (Reference) against the sequence where the corresponding changes were made according to the variation mapping (Variant). The structures and the correlation coefficients and the p-values were retrieved for further analysis. We used the default cutoffs as described in the previous study which used the algorithm to evaluate the effect of single nucleotide variations on RNA structure. The snoRNAs loci harboring variations which have the potential to change the RNA structure were further evaluated for their secondary structures Srna from Sfold suite of tools. Srna statistically samples the snoRNA structure from the Boltzmann ensemble of RNA secondary structures [Ding et al., 2004]. Single nucleotide variations in snoRNAs were also analyzed for their occurrence in the conserved boxes/motifs in snoRNAs which are critical for the binding of the substrate RNA with complementary motif.
Sequence alignment of and measures of conservation using two methods phastCons and phyloP, for all species (vertebrate) and two subsets (primate and placental mammal) were used to evaluate the context of the variation with reference to the evolutionary conservation. The data for the corresponding variations in the human genome were retrieved from respective tracks of the UCSC genome browser.
Allele Frequencies of Single Nucleotide Variations in Different Populations
The allele frequencies for the variations were retrieved from the HapMap database (http://hapmap.ncbi.nlm.nih.gov/biomart/martview). Minor allele frequencies (MAFs) were analyzed. The integrated Haplotype Scores (iHS) of these SNPs were retrieved from Haplotter [Voight et al., 2006] to evaluate selection at the respective loci. Haplotter provides a web-based interface to scan the genome for regions of recent selection and uses the allele frequency information from the HapMap project.
RESULTS AND DISCUSSION
Single Nucleotide Variations in snoRNA Loci
The mapping of single nucleotide variations in snoRNA loci was performed using custom scripts written in Perl. This was followed by manual checking and curation of the variations. Analysis of the mapping revealed that out of 381 snoRNAs (272 C/D box and 109 H/ACA box) approved by HGNC, 151 snoRNAs were found to harbor 298 SNPs. The complete mapping is made available as Supporting Information (Supp. Table S1). This approximately amounts to 40% of the total snoRNAs. These 151 snoRNAs included 105 C/D box snoRNAs with a total of 231 single nucleotide variations and 46 H/ACA box snoRNAs having 67 single nucleotide variations.
The snoRNA Locus Specific Variation Database
The entire data including snoRNA information the single nucleotide variations and relevant details and links were ported on to LOVD platform, snoLOVD, for easy access and curation by the community. The database consists of a total of 124 loci at the time of publication and has an easy to use web interface. The dataset comprises of 124 entries from both snoRNA-SNP mapping and literature surveys including those involved in diseases like PWS [Ding et al., 2008]. The database connects to the scientific community using external links and is manually checked and curated for annotation errors. The database is designed to serve the starting point for the variation studies in snoRNA loci and enables download of the entire information into tab-delimited files. The resource would also be open for data and suggestions from the community and is available at URL: http://genome.igib.res.in/snolovd. To ensure ease of access, we have also implemented a custom search interface implemented in CGI-Perl.
Analysis of Effect of Single Nucleotide Variations on RNA Secondary Structure
It is now well known that point mutations can cause changes in the secondary structure of RNA with potential implications in the function of the RNA. Earlier studies on RNA secondary structure used minimum free energy (MFE) as the criteria to assess the thermodynamic stability of the RNA but the recent reports have used a more rigorous approach, using the partition function as a measure to evaluate the effect of single nucleotide variations on RNA structure. The partition function approach uses the equilibrium ensemble of RNA secondary structures instead of considering a single refined or the MFE structure [Waldispuhl and Clote, 2007; Ding, 2006]. While a large number of single nucleotide variations do affect the local RNA structure, there are a few highly influential (or dominant) variations which can potentially have a large effect on the secondary structure of RNA [Halvorsen et al., 2010; Johnson et al., 2011]. We used SNPfold algorithm to evaluate the effect of single nucleotide variations on RNA secondary structure. SNPfold examines the effect of mutation in a RNA sequence and its ensemble structure. It utilizes the RNA partition function calculations and generates a partition function matrix for both the reference sequence and the variant sequences of snoRNAs and calculates the difference between the two. Based on these values, the Pearson Correlation Coefficient (PCC) was calculated for all the conformations of the given RNA. When PCC value tends to one, it indicates the negligible effect of the variation on RNA. The snoRNAs and their variants were evaluated based on the PCC values and the p-values. A p-value cutoff of 0.05 was employed and six of the 151 snoRNAs had variations at or below the cutoff. These six snoRNAs included eight C/D box snoRNAs and one H/ACA box snoRNAs. SNPs rs16837624 and rs12910266 as discussed by Johnson et al., 2011 were also present in our analysis but none of them showed significant affect on snoRNA structures. The complete list of snoRNAs, the variations the PCC and p-values are tabulated in Table 1.
Table 1. Complete listing of snoRNAs and single nucleotide variations having SNPfold p-values equal to or less than 0.05
Allele with position
The six snoRNAs with variations having significant changes in the structure were further analyzed for other correlates. Analysis of genomic conservation revealed none of the variations fell in highly conserved regions. The RNA secondary structures of the six reference snoRNAs were further compared with that of their variants using the Srna module from Sfold web server [Ding et al., 2004]. The structural changes were clearly distinguishable for the two sequences (reference v/s variant) strongly suggesting the role of the variants in affecting the RNA secondary structure in respective scenarios (Fig. 2). The complete data of comparisons are available as Supporting Information (Supp. Fig. S1). The structures were studied to match to the canonical snoRNA structures [Marz et al., 2011]. SNORD115-15 and SNORD49B were found to have exceptional structures quite different than their canonical ones. The single nucleotide variations were further checked for their occurrence in the conserved motifs in the snoRNAs. The analysis revealed that two snoRNAs (mgU6-53 and 14q(II-7)) had single nucleotide variation (rs77545594 and rs72700530) falling in the C/D box, while snoRNA (ACA37), had a variation (rs73483657) in the H/ACA box. These motifs guide the snoRNAs to bind with different types of proteins into the snoRNP complex and SNPs in these motifs could potentially alter their binding with the proteins and therefore the guide mechanisms.
Allele Frequencies and Potential Signatures of Selection
All 298 single nucleotide variations in snoRNAs were queried on the HapMap database to retrieve the allele frequencies of the variations in different world populations. We could retrieve allele frequency information for only 56 single nucleotide variations. The compiled data is depicted in Supp. Figure S2. We could not retrieve any information of the allele frequencies of any of the potentially functional single nucleotide variations from HapMap dataset. Further minor allele frequencies from dbSNP for 1000 Genome phase 1 genotype data from 629 worldwide individuals, released in the 08-04-2010 dataset was studied for the six snoRNAs. The data is compiled as Supp. Table S2 [Sabeti et al., 2002; Voight et al., 2006].
Adaptive evolution has been reported for brain specific snoRNAs identifying a few rare variants through resequencing [Ogorelkova et al., 2009] where rs28522423, rs8179188, rs12910266 were predicted to have functional consequences. The above mentioned SNPs were also present in our analysis but they did not show significant structural alterations in the structure of snoRNAs.
iHS scores from Haplotter were analyzed to evaluate whether the single nucleotide variations fall in regions of selection but none of the variants were found to have significant iHS scores. We also searched for GWAS signals around the snoRNAs loci, but none of the variants were shown to be previously associated with any genetic trait or disease.
Conclusions and Future Prospects
We have systematically collected single nucleotide variations in snoRNA loci and evaluated the variations for their potential effect on altering the secondary structure of snoRNAs. The snoRNA datasets and nomenclatures were adopted from HGNC which has a conservative dataset for snoRNAs. We are also aware that there are few other datasets where additional snoRNAs might exist. The database will be updated with upcoming datasets (Makarova et al., 2011) and will follow the most accepted and conservative nomenclature. We also integrate ancillary information from different genome-scale dataset to evaluate signals of selection. An open resource enabled with a web-based interface for query and retrieval of relevant information, was created. We hope this resource would serve as a central point for community involvement in curation and prioritization of variations for further in-depth experiments. With the availability of more data of genomic variations associated with disease processes and physiological traits, we hope to include the association information as and when they get published.
The authors acknowledge the inputs of Dr. Arijit Mukhopadhyay and Dr. Mohammed Faruq in the analysis and preparation of the manuscript. DB acknowledges the Senior Research Fellowship from CSIR, India. The authors acknowledge the funding support from CSIR, India through Grant NWP0036 (Comparative genomics of non-coding RNAs).