Comparison of cleavage site prediction accuracies
We experimentally determined the N-terminal sequences of 270 mature secreted and cell-surface proteins. The N termini of mature proteins were recorded as signal peptide cleavage sites, unless evidence existed for any further post-translational cleavage of protein precursors. This data set (Supplementary Table 1) was used for signal cleavage site studies. The performance of signal peptide prediction algorithms is usually evaluated in two areas: the ability to distinguish signal peptides from nonsignal sequences and the ability to locate the signal cleavage sites. Usually, only the percentages of correctly predicted sites among positively predicted sequences are reported. However, we find it more helpful to also include the overall percentage based on all test sequences.
Benchmarked by our confirmed signal peptide sequences, all six programs gave high sensitivities of detecting signal-containing proteins, with SignalP 2.0-HMM and SignalP 3.0-NN being the highest, both at 98.5%. However, these programs showed much greater variation of accuracies in predicting the cleavage sites (Fig. 1A). The best program appeared to be SignalP 2.0-NN, which precisely predicted 78.1% of sites. In contrast, both SigCleave and SigPfam yielded markedly lower accuracies. Our analysis indicates SignalP 2.0-NN is the best overall prediction program, consist with a previous observation (Menne et al. 2000). It should be noted that SignalP 2.0-HMM is reported to be better at distinguishing cleavable signal peptides from non-cleavable signal anchors (Nielsen and Krogh 1998), which we did not test. We did not observe major performance differences between the two versions of SignalP-NN, but the new version of SignalP-HMM (3.0) appeared to be better at recognizing signal cleavage sites than SignalP 2.0-HMM.
Analysis of matched SWISS-PROT data set
One of the intriguing aspects of developing and evaluating prediction algorithms is that both training and testing data sets are typically collected from the SWISS-PROT database. Although SWISS-PROT is arguably the best annotated protein sequence database, any incorrectly annotated entries will likely propagate into various prediction tools and results. It is therefore useful to evaluate the reliability of signal peptide annotation in the SWISS-PROT database. Of all the human protein sequences with annotated signal sequences in the current release of SWISS-PROT (Release 42), only 33.6% are marked to contain “Signal” under the feature table, which implies that there are experimental data for the presence and location of the signal peptides. The remaining 66.4% protein entries are marked to contain “Signal By Similarity,” “Signal Potential,” or “Signal Probable,” which implies that these are derived computationally, either by a signal peptide prediction program or by sequence similarity.
Of the 270 total protein sequences, 169 are represented by SWISS-PROT Release 42. Three of the SWISS-PROT entries, KAC_HUMAN, CK15_HUMAN, and CRI1_HUMAN, do not have the annotation of signal peptide even though they are either secreted or transmembrane proteins. Of the remaining 166 proteins marked to have signal peptide, 70.5% of the annotated cleavage sites are consistent with our verified sites. However, different types of signal annotation gave different results. Of the 113 protein entries marked to contain predicted signal peptides, only 63.7% of the predicted cleavage sites agree with our data. The remaining 53 are marked to contain “Signal” under the feature table and are therefore expected to be supported by experimental evidence. Forty-five of these annotated sites, or 85.0%, are identical to our verified sites. Those eight discrepancies were investigated further based on cited literature.
Surprisingly, five of the SWISS-PROT entries with discrepancies, AXO1_HUMAN, FCG1_HUMAN, HGFA_HUMAN, T10C_HUMAN, and TRLT_HUMAN, contain signal annotations based on prediction rather than experimental data (Allen and Seed 1989; Miyazawa et al. 1993; Tsiotra et al. 1993; Pan et al. 1997; Sica et al. 2001). In the case of T10C_HUMAN, both predicted (Pan et al. 1997) and experimental (Sheridan et al. 1997) data are available, but SWISS-PROT contains the predicted data. The experimental data are consistent with our results. For the remaining two proteins with discrepancies, INA5_HUMAN and INA6_HUMAN, it is not clear to us whether their signal annotations were based on prediction or experimental data. Therefore, although references are provided that support a “Signal” annotation, the papers themselves may not contain experimental data to support the claim. The remaining discrepancy associated with SWISS-PROT, PTHY_HUMAN, is caused by postcleavage modification of preProPTH (Vasicek et al. 1983). In this case we observed the site for postsignal cleavage sites. We estimate that such a mistake due to postcleavage modification has a very low occurrence and will not influence the overall quality of our data set. Regardless, the cleavage site for PTHY_HUMAN has been corrected.
To understand how previously reported prediction accuracies were achieved, we estimated the perceived accuracies for site prediction when benchmarked against SWISS-PROT annotation, as the SWISS-PROT data are usually used for validation. As shown in Figure 1B, when comparing with the SWISS-PROT entries that have “Signal” annotations, the perceived prediction accuracies for SignalP 2.0-NN and SignalP 2.0-HMM are similar to those observed by us (Fig. 1A). A much higher perceived accuracy (73.6%) was observed with the SigPfam program, which was originally trained with similar sets of SWISS-PROT human sequences (Zhang and Wood 2003). This discrepancy suggests that HMM models could be overtrained or that the SWISS-PROT data might not always be the best test data. As expected, when analyzing the SWISS-PROT entries with computationally predicted signal peptides, the perceived accuracies are mostly lower. Interestingly, none of the prediction programs tested here matches with the SWISS-PROT annotation extremely well, indicating the SWISS-PROT annotations were historically based on multiple prediction algorithms before recently settled down to the SignalP program.
Excluding the 45 proteins that are already correctly annotated in SWISS-PROT regarding the signal cleavage sites, the availability of our verified data for 225 proteins would represent a significant increase (32%) of human protein sequences with annotations of verified cleavage sites in the current SWISS-PROT database.
Improving signal peptide prediction using verified cleavage sites
The confirmed N-terminal sequences of mature proteins provide a reliable data source for studying preferential amino acid usage after the signal peptide cleavage sites. We first determined the expected frequencies of amino acid usage by sampling the entire mature proteins, and then compared the usage at each of the positions after the cleavage sites with expected frequencies. The log ratios of the observed and expected frequencies are plotted to reveal any usage preferences or antipreferences (Fig. 2). It is apparent that whereas some of the residues such as glycine are not preferentially used in any of the positions, several residues show biased usage. Tyrosine, for example, is obviously discriminated against in the first few positions after cleavage sites. Proline, on the other hand, is preferred in many of the sites with the exception of the +1 position, where it is almost never found. Glutamine was found at the +1 residue in 10.7% of the 270 proteins in the data set. Because N-terminal glutamine is cyclized to pyroglutamic acid by glutamine cyclase during protein synthesis (Kamp et al. 1998), the N-terminal pyroglutamic acid may serve to protect these secreted proteins from degradation by extracellular amino-peptidase.
The availability of our confirmed signal cleavage sites also provides new opportunities for refining existing prediction programs. By removing incorrectly aligned signal peptide sequences and adding new ones that were confirmed experimentally, we believe that the alignment-based models for signal peptides could be improved. As a test, we built a new Pfam-compatible HMM model for signal peptides based on these confirmed sequences. The sensitivity of the SigPfam program based on the new model increased from 91.5% to 93.5% based on a leave-one-out validation test. (The specificity was not tested for lack of negative data set for this study.) In addition, the overall cleavage site prediction accuracy increased from 58.1% to 73.7%. Despite this performance improvement, SignalP 2.0-NN remains the best signal prediction program. It is possible that the SignalP package can be improved further with the assistance of our data set. Nevertheless, the improved SigPfam would be a useful tool as it can be easily implemented, configured, and tuned. All components of SigPfam are freely available at http://share.gene.com/share/cleavagesite.