In the development of RNA interference therapeutics, merely selecting short interfering RNA (siRNA) sequences that are complementary to the mRNA target does not guarantee target silencing. Current algorithms for selecting siRNAs rely on many parameters, one of which is asymmetry, often predicted through calculation of the relative thermodynamic stabilities of the two ends of the siRNA. However, we have previously shown that highly active siRNA sequences are likely to have particular nucleotides at each 5′-end, independently of their thermodynamic asymmetry. Here, we describe an algorithm for predicting highly active siRNA sequences based only on these two asymmetry parameters. The algorithm uses end-sequence nucleotide preferences and predicted thermodynamic stabilities, each weighted on the basis of training data from the literature, to rank the probability that an siRNA sequence will have high or low activity. The algorithm successfully predicts weakly and highly active sequences for enhanced green fluorescent protein and protein kinase R. Use of these two parameters in combination improves the prediction of siRNA activity over current approaches for predicting asymmetry. Going forward, we anticipate that this approach to siRNA asymmetry prediction will be incorporated into the next generation of siRNA selection algorithms.
Therapeutic applications of RNA interference (RNAi) make use of a conserved pathway for gene expression regulation that possesses the potential for exquisite sequence specificity through the complementarity of short interfering RNAs (siRNAs) for their targets [1-3]. Although the technology has yet to demonstrate its full potential in clinical applications [4, 5], there remains major interest in developing siRNA-based therapeutics . Because RNAi represents a therapeutic approach that can be applied to nearly any disease [7, 8], improvements in the design and development of siRNA therapeutics have the potential to have a significant impact on clinical practice.
A number of intermolecular interactions are critical to the activity of siRNAs, including those with the delivery vehicle [9-11], the target mRNA [12-15], and the pathway proteins [16-19]. Whereas a single RNA guide strand and argonaute 2 are the minimal components required for active silencing in vitro , the proteins Dicer and TAR RNA-binding protein are important for RNA-induced silencing complex (RISC) loading complex/RISC activity in vivo [21-23]. Other proteins, such as the protein activator of dsRNA-dependent protein kinase R (PKR) [24, 25] and component 3 promoter of RISC , may also have important but as yet undefined functional roles in the RNAi process. One essential process executed by the pathway proteins is the identification and loading of the siRNA guide strand into RISC loading complex/RISC and the concomitant destruction of the passenger strand [2, 27, 28]. The likelihood of one siRNA strand becoming the guide strand relative to the other strand is termed asymmetry [27, 29].
There are currently multiple proteins that are thought to participate in sensing the asymmetry of siRNA duplexes [18, 30-32]. When the existence of siRNA asymmetry was first established, it was proposed that the relative hybridization stabilities of the two ends of the siRNA sequence constituted the principal means by which asymmetry was sensed by the pathway proteins . Since that time, nearly all algorithms for selecting highly active siRNAs have used a thermodynamic calculation for asymmetry, among other parameters [29, 33-37]. More recently, evidence has begun to accumulate that the terminal nucleotides on each 5′-end of the siRNA may be valuable for predicting the activity of an siRNA [18, 30, 38], in particular when classified according to the 16 possible combinations of nucleotides. When terminal nucleotide classification is combined with relative hybridization stability, the accuracy of predicting siRNA activity improves markedly .
In this study, we wanted to predict siRNA activities on the basis of only two asymmetry characteristics, terminal nucleotide classification and relative thermodynamic stability (Fig. 1), and establish experimentally their relative importance in determining the activity of an siRNA. Using a logistic regression model, we successfully predicted active and inactive sequences for the exogenous protein enhanced green fluorescent protein (EGFP) and the endogenous protein PKR. In addition, the combination of both end-sequence and thermodynamic stability features provided improved correlation with siRNA activity as compared with either feature individually. These results demonstrate that asymmetry may be determined by more factors than just relative stability, and algorithms for prediction of siRNA activity should also account for terminal nucleotide sequence classification in asymmetry calculations.
Ranking and selection of EGFP-targeting siRNAs
Our ranking algorithm was initially tested on siRNAs to target the EGFP mRNA. From the cDNA sequence (Doc. S1), there were 824 possible siRNA sequences, which were ranked according to the difference between the algorithm's predicted likelihood of high and low activity. For comparison, commercial algorithms (Dharmacon and Ambion) were also used. These selection algorithms were chosen because their predictions are based solely on the characteristics of the siRNA and not on other factors used in some selection algorithms, such as target mRNA structure, which would make it difficult to directly compare the accuracy of our asymmetry-based predictions with predictions from more detailed selection approaches. The commercial rankings only included sequences predicted to have high activity, as opposed to the entire range of possible siRNA sequences. Although this is adequate for those needing effective siRNA sequences, it does not provide sufficient data to enable comparison of the characteristics of high-activity and low-activity siRNAs. The Dharmacon algorithm ranked the recommended siRNAs, whereas for the Ambion algorithm there were no distinctions among the top 35 candidate sequences. Interestingly, there was no overlap between the lists of recommended sequences provided by the commercial algorithms. Aggregating the commercial recommendations with our predictions, we chose 11 sequences to test experimentally that would allow us to preliminarily compare the relative utility of the three prediction approaches (Table 1).
Table 1. EGFP-targeting siRNA sequences selected for this study, sorted by algorithm rank.
5′ Target position
Thermodynamic ΔΔG (kcal·mol−1)
Highest algorithm ranking
Highest Dharmacon rank (91)
On Ambion list
High end, low ΔΔG
On Ambion list
On Ambion list
On Ambion list
Low rank, high Dharmacon (73)
Low end, high ΔΔG
On Ambion list
Lowest algorithm rank
Transfection experiments were performed with human lung carcinoma (H1299)-EGFP cells at various siRNA concentrations (Fig. 2). Surprisingly, 81% of the sequences had some silencing effect as compared with control treatments, with 73% of sequences showing a > 75% reduction in protein levels at siRNA concentrations of 50 nm and 100 nm. One sequence (EGFP 783) showed intermediate silencing (the difference from other sequences is indicated by double asterisks), suggesting a gradient, rather than a step change, in silencing ability between active and inactive sequences. In general, sequences predicted by our algorithm to have higher activity showed increased inhibition of EGFP fluorescence. The rank order of activities was maintained at lower (5 nm) and saturating (50 nm and 100 nm) siRNA concentrations. The two sequences chosen on the basis of their opposing rankings between the two features in our approach, EGFP 757 (favorable terminal sequence; unfavorable ΔΔG) and EGFP 783 (unfavorable terminal sequence; favorable ΔΔG), ultimately showed activities that correlated with their terminal nucleotide classification rather than their thermodynamic stability.
To further investigate the hypothesis that terminal sequence classification was more important than thermodynamic stability in predicting siRNA activity, silencing efficiencies were compared against algorithm rank, terminal nucleotide rank and ΔΔG values individually (Fig. 3). The correlation between activity and terminal nucleotide rank was better than for thermodynamic stability alone, with the correlation with full algorithm rank being better than either alone. This agrees with our prior work showing that terminal nucleotide classification is generally a more informative predictor of siRNA activity, but that inclusion of the thermodynamic calculation provides some additional complementary information .
Ranking and selection of PKR-targeting siRNAs
Although our algorithm successfully predicted active and inactive sequences for EGFP, we wanted to confirm similar results for an endogenous protein, the signaling pathway mediator PKR. In addition, through systematic selection of siRNAs of high, medium and low nucleotide ranking and siRNAs having high, medium and low relative thermodynamic stabilities (Table 2), we aimed to further explore the relative importance of each of these characteristics in predicting silencing activity. We selected PKR as a model endogenous protein on the basis of our prior work and expertise in silencing this protein . Although PKR is a double-stranded RNA-responsive protein and is known to be functionally connected to proteins in the RNAi pathway , the lack of any cytotoxicity across all of our experiments suggested that it was not initiating any generalized immune response to the siRNAs that would confound our specific silencing results.
Table 2. PKR-targeting siRNA sequences selected for this study, sorted by end-sequence ranking, followed by relative thermodynamic stability.
5′ Target position
Thermodynamic ΔΔG (kcal·mol−1)
High end, high ΔΔG
High end, medium ΔΔG
High end, low ΔΔG
Medium end, high ΔΔG
Medium end, medium ΔΔG
Medium end, low ΔΔG
Low end, high ΔΔG
Low end, medium ΔΔG
Low end, low ΔΔG
Transfection experiments were performed with hepatocellular carcinoma (HepG2) cells at 100 nm siRNA (Fig. 4). In this case, 55% of sequences showed a > 50% reduction in PKR protein levels as compared with control cells. Again, sequences predicted to have higher activity showed increased reduction in PKR protein levels. When sorted by end-nucleotide classification, sequences in the UG class showed the best silencing activity, on average, regardless of their thermodynamic stability. Conversely, sequences in the low-ranking terminal nucleotide class, CU, did not show significant silencing, even when they had highly positive ΔΔG values. In the intermediate category of end sequence (AA), silencing activity correlated strongly with the calculated thermodynamic stability, with a favorable value resulting in significant silencing and an unfavorable value not. Taken together with our results from the EGFP experiments, these results support the argument that terminal sequence classification is a stronger predictor of siRNA activity that relative hybridization stability.
Our algorithm achieved a better correlation between rank and siRNA activity than achieved by either terminal nucleotide or thermodynamic stability independently (Fig. 5). Although the PKR terminal nucleotide correlation is less than that achieved from the EGFP data, the thermodynamic stability and overall algorithm rank correlation coefficients remain similar, even with a larger range of possible siRNA sequences (2352 for PKR versus 824 for EGFP), showing the algorithm's fidelity for targets of various sizes. It is noteworthy that, for both sequences, the top-ranked sequence, i.e. the one that would have been chosen for predicting the activity of an siRNA against a new target, was highly active in silencing the target, further supporting our hypothesis that using only two parameters was sufficient for identifying active siRNAs against novel targets.
The use of asymmetry is well established as being useful and important for selecting active siRNA sequences. Multistep workflow protocols for selecting effective siRNAs have been developed [41, 42]. However, the selection algorithms themselves are based in part on using relative thermodynamic stability as the sole factor in determining sequence asymmetry. Other reported algorithms lack a consensus on the best way to calculate thermodynamic asymmetry for siRNA activity prediction [36, 43-47]. When the commercially available algorithms were utilized for comparison in this study, there were no overlaps between the sequences predicted to be highly active by Dharmacon and those predicted by Ambion, further illustrating the need for a consensus approach to selecting highly active siRNAs, including which parameters are most important/useful for such predictions.
The results described here further illustrate that accounting only for thermodynamic asymmetry ignores a more important feature in asymmetry, terminal nucleotide classification. Although others have identified terminal nucleotides as factors relevant to siRNA design (e.g. ), our approach of pairing the antisense and sense termini and weighting each pairing individually provides a unique and improved classification for sequences and their activities. Our selection technique achieves correlation coefficient values of > 0.8, whereas previously reported algorithms typically achieve correlation coefficients of 0.5–0.7 between algorithm predictions and experimental results [41, 48]. As our approach is focused on the contribution of asymmetry in the siRNA to its ultimate activity, it was important to compare our approach with other ways of determining asymmetry. In calculating thermodynamic asymmetry, it is common to use one, three or four nearest neighbors at each end of the siRNA as the basis for calculation [29, 36, 48, 49]. In our prior work , we showed that, in concert with terminal nucleotide classification, calculating thermodynamic asymmetry with three nearest neighbor parameters provided the most information, whereas one nearest neighbor provided the most information in the absence of terminal nucleotide classification. On the basis of this context, we compared the correlation of our data with thermodynamic calculations performed with one, three and four nearest neighbor parameters (Fig. 6). In all cases, the correlation of our experimental data was best with rankings including terminal nucleotide classification (Table 3). This strongly supports our contention that all siRNA selection algorithms would be improved by the inclusion of our asymmetry approach in place of their current asymmetry calculation.
Table 3. The correlation coefficients between EGFP silencing at 100 nm and the four different ranking methods in Fig. 6.
Correlation coefficient (r)
ΔΔG (1 nearest neighbor)
ΔΔG (3 nearest neighbors)
ΔΔG (4 nearest neighbors)
Our algorithm, as structured, only ranks sequences according to the likelihood of them silencing the intended target. It was our intention in this study to determine whether the factors that we tested were useful in predicting sequences of high activity. Our ranking approach does not account for potential off-target effects of the sequences. For the long-term design of siRNA therapeutics, it is essential that off-target effects be taken into account as well. That said, it is our belief that beginning therapeutic design with the most highly active sequence, which can then be modified, if needed, to improve its specificity, is a better approach to obtaining a useful therapeutic than beginning with the most highly specific sequence, which may then need to be modified to improve its silencing activity against the intended target.
Although our approach is useful for predicting active (and inactive) siRNAs, we have not yet established the causal relationship between the terminal nucleotide classification and siRNA processing and activity. Indeed, it is well established that the properties of the siRNA alone provide only partial information regarding the likely activity of the siRNA [14, 15, 50, 51]. However, studies are increasingly reporting that more active siRNAs and microRNAs tend to contain specific nucleotides at the 5′-position of the guide strand [18, 33, 52], possibly a result of argonaute 2 binding specificity . We expect that, going forward, our ongoing work and that of others will more firmly tie the presence of particular 5′-end nucleotides on both the guide and passenger strands with important siRNA–protein interactions that occur in the pathway to ensure proper siRNA processing.
Algorithm design and parameters
Using information from both terminal nucleotide classification and thermodynamic stability, we developed a 17-parameter logistic regression model based on both the 16 possible end-sequence combinations and the relative thermodynamic stability (Table 4; additional details in Doc. S1). The relative stability is calculated from the difference between the hybridization free energy from the 5′-end of the antisense strand and the 5′-end of the sense strand, termed ΔΔG, based on the three terminal nearest neighbor pairs [38, 53] (Fig. 1). ΔΔG calculations were based on 21-nucleotide siRNAs with equivalent UU overhangs on the end of each strand. This calculation technique, when coupled with terminal sequence information, was shown to have the best predictive accuracy when tested on existing siRNA activity databases . The weighting factors for each of the 17 parameters were based on fitting the model to the same siRNA databases [43, 44]. From a cDNA sequence input, the algorithm predicts the probability that the given sequence will have high, medium or low activity. By use of the difference between the high and low probabilities, each siRNA sequence for a given target was ranked from the highest to the lowest difference. The cDNA sequences used for EGFP and PKR are included in Doc. S1, with siRNA target regions highlighted. For comparison with other asymmetry approaches (Fig. 6), asymmetry calculations were also performed with one and four nearest neighbor parameters. To ensure the most accurate comparisons across ranking approaches, sequences with equivalent values were all given the best possible ranking.
Table 4. Values of the coefficients used as weighting factors for predicting the probability of siRNAs having high (top third among all possible sequences) and low (bottom third among all possible sequences) activity.
Positive values indicate features that are positively correlated with high siRNA activity. By application of these weights in our algorithm, siRNAs were ranked by the magnitude of the difference between the probability of having high activity and the probability of having low activity. The ‘Intercept’ is applied to all sequences, and arises from the regression model (Doc. S1).
Lipofectamine 2000 (LF2K) was purchased from Invitrogen. All EGFP and PKR siRNA sequences were 21 nucleotides long (19 bp plus UU overhangs), and were purchased from Dharmacon. Opti-Mem (Gibco) was used for preparation of all transfection solutions. Monoclonal primary antibody against PKR (Y117) was purchased from Novus Biologicals. Monoclonal primary antibody against β-actin was purchased from Sigma. Secondary antibodies (anti-rabbit and anti-mouse) were purchased from ThermoScientific.
H1299 cells constitutively expressing a form of EGFP, modified to have a 2-h half-life, were generously provided by J. Kjems (University of Aarhus, Denmark). HepG2 cells were purchased from the American Type Culture Collection. Cell culture medium was prepared with DMEM High Glucose (Invitrogen) supplemented with 10% fetal bovine serum (Gibco) and 1% penicillin/streptomycin (Gibco). For the H1299 cells, 1% Geneticin (Gibco) was added to maintain EGFP expression. Cells were incubated at 37 °C in 5% CO2, at 100% relative humidity, and subcultured every 4–5 days by trypsinization.
EGFP silencing and fluorescence analysis
H1299-EGFP cells were seeded in 96-well, black-sided, clear-bottomed plates (Fisher Scientific) at a density of 20 000 cells per well in 0.1 mL of complete medium without antibiotics. After 24 h, 50-μL solutions of various siRNAs and LF2K were prepared in Opti-Mem, and allowed to mix for 30 min prior to their addition to cells at final concentrations of 5–100 nm siRNA and 2.3 μg·mL−1 LF2K. Cells were incubated in the transfection solutions at 37 °C in 5% CO2, at 100% humidity. After 24 h, cells were washed twice with Dulbecco's NaCl/Pi (Gibco), and EGFP fluorescence was quantified with a Gemini EM fluorescent plate reader (Molecular Devices) at 480-nm excitation and 525-nm emission. Fluorescence intensity was normalized to control wells that were treated with transfection reagent but no siRNA. Cytotoxicity was assessed by microscopy, and was not seen in any of the treatments (Fig. S1).
PKR silencing and western blotting
For siRNA transfections targeting human PKR in HepG2 cells, reverse transfection was performed. Briefly, 250-μL solutions of various siRNAs and LF2K were prepared in Opti-Mem, and allowed to mix for 30 min prior to their addition to standard six-well tissue culture plates. Freshly trypsinized HepG2 cells suspended in antibiotic-free medium were added to the six-well culture plates at a density of 1.5 × 106 cells per well to achieve final concentrations of 100 nm siRNA and 2.3 μg·mL−1 LF2K. The cells were then incubated at 37 °C in 5% CO2, at 100% humidity, for 48 h and collected. PKR levels were measured by western blot analysis (Fig. S2). The cells were washed twice with cold NaCl/Pi, and lysed in 200 μL per well of CelLytic M cell lysis buffer (Sigma) supplemented with protease inhibitor cocktail (Sigma). The cell lysate was clarified by centrifugation at 15 000 g for 10 min, and the supernatant was collected. Total protein levels were quantified with the Quick Start Bradford protein assay (BioRad). Approximately 20 μg of total protein was resolved, with 8% resolving, 5% stacking SDS/PAGE gels. Proteins were then transferred to nitrocellulose membranes, and probed with primary and secondary antibodies. Biotinylated protein ladders (Cell Signaling Technology) were loaded onto one well of each SDS/PAGE gel, and antibody against biotin was used to detect the protein ladders on the western blots. Antibody detection was performed with the SuperSignal West Femto Chemiluminescence substrate kit (ThermoScientific), and imaging was performed with the Molecular Imager ChemiDoc XRS System (Bio-Rad). Band intensity was first normalized to actin to control for protein isolation and loading, and the ratio was then normalized to the ratio for control cells that received transfection reagent but no siRNA. Cytotoxicity was assessed by microscopy and CellTiter-Blue assay (Promega) (Fig. S1), and was not seen in any of the treatments.
Multiple comparisons between protein levels across different siRNA treatments conditions were performed with one-way (PKR data) or two-way (EGFP data) ANOVA followed by Tukey's HSD post hoc analysis with the P-value cut-off set at 0.05. Analyses were performed with either Microsoft Excel or minitab.
Financial support for this work was provided in part by Michigan State University (MSU Foundation, Center for Systems Biology, and MSU Graduate School), the National Institutes of Health (GM079688, GM089866, RR024439, DK081768, and DK088251), and the National Science Foundation (CBET 0941055). We thank all members of the Cellular and Biomolecular Laboratory for their advice and support. We also thank J. Kjems (University of Aarhus, Denmark) for providing us with the EGFP cells.