Protein secretion is a key mechanism of the interaction between the cell and its environment. Exported proteins can be anchored in the cytoplasmic or outer membrane, retained in the periplasmic space, or released into the medium (or, in case of parasites, into the host). The functions of exported proteins include, among others, hydrolysis of extracellular polymers (proteins, nucleic acids, polysaccharides), synthesis of the cell wall and intercellular matrix, such as biofilm, modification of the host cell, and sensing of the environmental conditions. The relative proportion of secreted and cytoplasmic proteins is a useful parameter for genome-based reconstructions of the organism's behaviour and metabolism (see e.g. Galperin, 2005).
In bacteria, translocation of proteins across the cytoplasmic membrane into the periplasm or the extracellular space is typically mediated by the interaction of their N-terminal signal peptides with the protein secretion machinery. Immediately after translocation, signal peptides are proteolytically removed by leader peptidases that are located in the periplasm or attached to the outer leaf of the cytoplasmic membrane (Paetzel et al., 2002). Thus, identification of signal peptides informs on whether the given protein functions within or outside the cell, and has become an important component of genome annotation.
In classical experiments, signal peptides were identified on a case-by-case basis by determining the N-terminal sequence of the mature protein using Edman degradation and then comparing it with the sequence of the respective gene. This method is laborious and time-consuming and therefore hardly appropriate for whole-genome analysis. A thorough study of Escherichia coli proteins by Link and colleagues (1997), which used 2D gel analysis followed by Edman degradation, mapped 12 signal peptides. These and other early identifications provided the data necessary to elucidate the common feature of signal peptides: a tripartite structure, consisting of a positively charged n-domain, a hydrophobic h-domain, and a polar c-domain, which contains the cleavage motif (von Heijne, 1985; 1990). These observations allowed development of effective computational methods for predicting signal peptides, such as SignalP and Phobius (Nielsen et al., 1997a,b; Käll et al., 2004), which have become a key component of any genome analysis. Over the years, these software tools have been improved in order to increase sensitivity and performance (Nielsen et al., 1999; Käll et al., 2007) and to allow better recognition of non-standard signal peptides and those with non-cleavable signal sequences (Bendtsen et al., 2004; 2005; Petersen et al., 2011). Several other software tools for prediction of signal peptides (Fariselli et al., 2003; Hiller et al., 2004; Shen and Chou, 2007; Frank and Sippl, 2008; Reynolds et al., 2008) and protein localization (Nakai and Horton, 1999; Yu et al., 2010) have been developed. However, prediction methods are still usually trained on a relatively small set of experimentally confirmed targeting sequences from a narrow taxonomic range of organisms and therefore their reliability in predicting signal peptides of as yet unknown type(s) remains suspect (Payne et al., 2012). There is an obvious need for new high-throughput experimental approaches for identifying signal peptides that would provide a critical mass of reliable data for further progress of the popular software tools.
During the last decade, combined fractional diagonal liquid chromatography in conjunction with mass-spectrometry has been used to define N-terminal peptides in human and two halophilic bacteria (Gevaert et al., 2003; Aivaliotis et al., 2007; Staes et al., 2011). Proteogenomics (using mass spectroscopy to identify proteins predicted from genomic sequences) has emerged as a promising approach to genome annotation, particularly for high-throughput identification of protein N-termini, a task that is not fully solved by the existing gene-calling algorithms (see e.g. Frishman et al., 1998). In addition, a comparison of the experimentally determined N-termini with those predicted by sequence analysis tools allows identification of the signal peptides and prediction of the extracytoplasmic (periplasmic or extracellular) localization of the respective proteins (Gupta et al., 2007; Venter et al., 2011). Proteogenomics approach reports a peptide cleaved from the protein N-terminus by observing spectra matching to the peptides with non-tryptic N-termini (each such peptide represents a potential new N-terminus of the protein after signal peptide cleavage). However, not all such peptides are true signal peptides and some filtering is needed to remove artefacts and obtain the set of likely signal peptides. Proteogenomics studies of Shewanella oneidensis (Gupta et al., 2007), Yersinia pestis (Payne et al., 2010), a microbial biofilm community (Erickson et al., 2010), and the fungus Aspergillus niger (Braaksma et al., 2010) revealed hundreds of signal peptide cleavage events.
Given the success of the proteogenomics studies in discovering signal peptides, we sought to validate this experimental technique on classical model organisms with extensive functional annotation. Such validation aimed to (i) evaluate the reliability and specificity of the identification of N-terminal peptides through proteogenomics and (ii) compare it with the reliability of the widely used computational tools. In other words, we tried to evaluate the utility of the proteogenomic approach as a means of the genome-scale identification of bacterial signal peptides, particularly for investigation of yet unexplored organisms (Payne et al., 2012). In addition, we tried to address a more fundamental question: what fraction of the total E. coli proteome have signal peptides and are therefore destined for export outside the cytosol?
We present here an analysis of putative signal peptides for E. coli K-12 strain MG1655 (Blattner et al., 1997; Riley et al., 2006). According to the UniProt database (The UniProt Consortium, 2012), E. coli has the highest number of experimentally confirmed signal peptides among all bacteria. Moreover, almost 600 E. coli proteins have experimentally determined cellular localization (Lopez-Campistrous et al., 2005). We show here that a single proteogenomics experiment recovered more than one-third of all experimentally known signal peptides from E. coli. In addition, it experimentally confirmed a number of previously predicted signal peptides. Overall, after appropriate filtering, all putative signal peptides reported by proteogenomics appeared to correspond to actual signal peptide cleavages previously reported or predicted for E. coli (albeit with six of them having non-standard cleavage sites). Early analyses projected that at least 15–20% of proteins in Gram-negative bacteria should have signal peptides (Nielsen et al., 1997a). For E. coli, this fraction was estimated to be in the range of 17–32% (Hiller et al., 2004; Käll et al., 2004; Shen and Chou, 2007). Using proteogenomics data, we argue that this was an overestimation, and that the actual number of proteins with signal peptides in E. coli is likely to be substantially smaller, about 10%, which is consistent with the estimates from the latest version of SignalP.
Generation and analysis of proteogenomic data
The proteomic data were generated following a previously described protocol (Payne et al., 2010), see Supporting information for details. Briefly, proteins purified from a whole-cell lysate of E. coli K-12 strain MG1655 were digested into peptides with trypsin (see Lewis et al., 2010). These peptides were analysed by MS/MS and the resulting spectra were matched to peptide sequences from the E. coli reference protein set in the RefSeq database (Pruitt et al., 2012) using the InsPecT program (Tanner et al., 2005), followed by re-scoring with PepNovo (Frank, 2009) and MSGF (Kim et al., 2008). Protein identifications were remapped to UniProt. For each identified protein, the most N-terminal peptide was selected; the 488 proteins whose peptide was not produced by tryptic digestion (i.e. was not preceded in the protein sequence by a Lys or Arg residue) were analysed further (see Fig. 1 and Table S1). Peptides missing only the N-terminal methionine were discarded. For the remaining 129 proteins, the sequence between the start and the first observed peptide was designated the putative signal peptide. These 129 peptides (Table S2) were sorted and filtered for an appropriate length (18–36 amino acids) and checked for the presence of a hydrophobic patch of at least eight consecutive residues (Fig. 2; for filtering details, see Supporting information).
As a result of this filtering, the initial list of 488 was trimmed to a set of 96 putative signal peptides (Table S3), which were compared with the data stored in the UniProt, SPdb, and EcoGene databases (Choo et al., 2005; The UniProt Consortium, 2012; Zhou and Rudd, 2013) and with predictions of the SignalP program (Bendtsen et al., 2004). UniProt (http://www.uniprot.org/) annotates signal peptides (as well as protein localization and function) and categorizes them as experimentally proven (by direct protein sequencing), inferred by similarity, or potential (e.g. predicted by both Phobius and SignalP; see http://www.uniprot.org/manual/signal). SPdb database (http://proline.bic.nus.edu.sg/spdb, last updated in 2009) also provides information on both experimentally confirmed or predicted signal peptides, whereas EcoGene (http://ecogene.org/) provides a detailed manually curated annotation of E. coli genes, including the positions of known or predicted cleavage sites. For E. coli K-12, the current lists of experimentally verified signal peptides in UniProt, SPdb, and EcoGene include 156, 66, and 144, peptides, respectively, for a total of 163. Accordingly, each putative signal peptide identified by proteogenomics was checked to see (i) whether a signal peptide of the same length was reported by UniProt, SPdb, or EcoGene database or predicted by SignalP, and (ii) if it exhibited a typical signal peptidase I cleavage pattern [AGILSV] × [AGS] at positions −3, −2, or −1 relative to the cleavage site (Payne et al., 2010).
Efficiency and caveats of the proteogenomic approach
Comparison of the set of 96 putative signal peptides obtained in this work (Table S3) against the 156 experimentally confirmed signal peptides from UniProt showed 58 peptides that were common between these two datasets. Thirty-one more peptides, while not experimentally confirmed until now, were consistent with SignalP and Phobius predictions, as listed in UniProt (of which 24 were also consistent with UniProt annotations, see Table S3). Twenty-two of the 31 respective proteins were localized either in the periplasm or in the outer membrane, for eight proteins the cellular location remained unknown, and just one, potassium transporter KefA, was an integral cytoplasmic membrane protein (in UniProt, membrane-anchored periplasmic proteins AmpH, DacC, NlpA, YfhM, and YgiM were listed as localized in the cell membrane). The functions of these proteins included transport, stress response, and biofilm formation. One more identified peptide had a typical cleavage site but slightly different length than that predicted by SignalP and Phobius. The six remaining peptides came from known periplasmic proteins but differed in length from those listed in UniProt and/or predicted by SignalP and had atypical cleavage sites (Table S3); these peptides still remain to be properly validated. Thus, a single proteogenomic experiment revealed more than one-third (58 out of 156) of all previously validated signal peptides and affirmed several new ones. Ninety peptides out of 96 were consistent with previous experimental data and/or predictions of the SignalP program and/or had typical cleavage sites.
Despite its efficiency in identifying signal peptides, the proteogenomic approach has a number of caveats. First, by discarding tryptic peptides, the filtering procedure necessarily removes those signal peptides that end in Lys or Arg. Fortunately, the number of such experimentally validated signal peptides in bacteria appears to be miniscule. There is an experimentally studied case of cytochrome oxidase subunit 2 (QOX2_BACCR) in Bacillus cereus with the signal peptide apparently ending in a Lys residue (Contreras-Zentella et al., 2003). Other examples of such bacterial signal peptides listed in UniProt (beta-lactamases BLA1_BACCE and BLAC_STRAL, listeriolysin TACY_LISMO, and staphylococcal enterotoxin ETXE_STAAU) do not look credible and probably reflect additional proteolysis of the source proteins prior to Edman degradation. However, there are more than 80 such cases reported in eukaryotes, e.g. yeast asparaginase ASP21_YEAST (Kim et al., 1988). Some of these cases might be genuine, others may reflect sample degradation by endogenous trypsin-like proteases, which would also be relevant for future studies.
Second, the signal peptide mapping procedure could be affected by the action of cellular peptidases other than signal peptidases I and II. In the example of the periplasmic osmotically inducible protein Y (OsmY) shown on Fig. 1, six different peptides starting from E29 supported a 28 aa signal peptide with a perfect cleavage site, which was also consistent with the earlier results of Link and colleagues (1997). However, detection of a single peptide with the cleavage point after M16 (Fig. 1) led to the erroneous conclusion that the N-terminal signal peptide was 16 amino acids long, which was below the 18–36 aa filtering window and therefore resulted in the exclusion of OsmY from the final set (Table S2). In addition to OsmY, the analysed set of 129 N-terminal peptides contained four more instances (DegP, DppA, HdeA, and LivJ) of apparent false-negative assignments, caused by detection of short signal peptides along with those of correct size.
Third, some short(er) N-terminal peptides could arise from tryptic hydrolysis of full-length proteins that had not yet been processed by the signal peptidases. This results in a false-negative assessment similar to what was observed for OsmY. For example, for the well-known (and abundant) outer membrane proteins OmpC and OmpF, many MS/MS identified peptides supported the canonical signal peptides starting at the positions 22 and 23. However, we also identified other upstream MS/MS peptides that could only originate from an unprocessed protein. Detection of these peptides has been registered as evidence of the absence of N-terminal processing of OmpC and OmpF, again resulting in false-negative assignments.
Finally, we have observed several instances where the signal peptide cleavage sites reported by MS/MS differed from experimentally determined and/or predicted ones (Table S3). The reasons for these discrepancies are not clear at this time and deserve further investigation.
Evolution of signal peptide prediction
Our next step was to evaluate the fraction of all signal peptides of E. coli detected in the proteogenomics experiment. To our great surprise, we have come across vastly different estimates of the total number of signal peptides in the E. coli proteome. The first proteome-wide estimate of the number of signal peptides was given in the seminal paper introducing the SignalP method (Nielsen et al., 1997a). Signal peptides were identified in 330 out of 1680 (20%) of the Haemophilus influenzae proteins, leading the authors to estimate that ∼ 15–20% of any Gram-negative proteome would possess signal peptides. The next release of the SignalP program, SignalP 2.0, came in two variants, which relied on hidden Markov models and on neural network and predicted 20% and 32%, respectively, of the E. coli proteome to possess signal peptides (Nielsen et al., 1997b; 1999). However, more recent updates of the SignalP program reported substantially lower fractions of E. coli proteins with signal peptides, 14–17% for the two variants of SignalP 3.0 (Bendtsen et al., 2004) and even lower, ∼ 10%, for the more recent version, SignalP 4.0 (Petersen et al., 2011). We have confirmed these estimates by running various versions of SignalP against the current UniProt reference set of 4303 E. coli K-12 proteins (Fig. 3). To a significant degree, this could be attributed to the better discrimination between cleavable and non-cleavable signal peptides by the latter versions of the program (Petersen et al., 2011).
Other signal peptide prediction tools exhibited a similar behaviour. PrediSi (Hiller et al., 2004) predicted cleavable signal peptides in 1075 (∼ 25%) E. coli sequences, Signal-3L (Shen and Chou, 2007) predicted 1479 E. coli proteins (34% of the total proteome) to be secreted. Two other programs, Phobius (Käll et al., 2004) and the later Philius (Reynolds et al., 2008) came up with much smaller numbers, predicting signal peptides in 17–19% of E. coli proteins.
We attempted to provide an independent estimate of the number of signal peptides in E. coli by comparing the proteogenomics results with the lists of predicted signal peptides. As mentioned above, among those signal peptides listed as ‘potential’ in UniProt, 24 predictions have been fully confirmed and for 14 more proteins our data indicated a different cleavage site (Table S3). For two proteins with ‘potential’ signal peptide, the ethanolamine utilization protein EutM and uncharacterized protein YeeZ, MS/MS detected N-terminal tryptic peptides, MEALGMIETR and MKKVAIVGLGWLGMPLAMSLSAR, respectively, suggesting that they do not have (cleavable) signal peptides. For EutM, indeed, experimental data, including the 3D structure of the full-length protein, confirm that this protein is not being processed and stays in the cytoplasm, albeit in a separate carboxysome-like microcompartment (Tanaka et al., 2010). We could only find some tentative data for YeeZ, suggesting that it might act outside the cell (Lynch et al., 2007). Thus, EutM appears to be the only false-positive prediction in the UniProt list; YeeZ could be another one or a false-negative case like OmpC and OmpF discussed above. As a result, the UniProt list of ‘potential’ signal peptides appears to be very close to reality, and the full set of 156 confirmed and 337 predicted signal peptides, accounting for 11.5% of E. coli proteins, can be considered the best available estimate. This means that the proteogenomics experiment reported here detected ∼ 20% of all E. coli signal peptides.
Signal peptide prediction for other organisms
The downward trend in the estimates of the number of signal peptides is also seen for other organisms, such as the Gram-positive bacterium Bacillus subtilis (Fig. 3) and the yeast Saccharomyces cerevisiae (data not shown). As already noticed by Leversen and Wiker (2012), for mycobacteria successive versions of SignalP predict fewer and fewer signal peptides, apparently gradually approaching the true number. The improvement achieved by SignalP 4.0 over previous versions of SignalP can primarily be explained by the explicit discrimination between signal peptides and signal anchors, but also by a substantially larger dataset of experimentally known signal peptides available for training (e.g. 2935 Gram-negative instances in SignalP 4.0 vs 334 in SignalP 3.0).
It should be noted that, despite the extremely high accuracy of SignalP predictions for E. coli, the fraction of false-positive and false-negative predictions could be substantially higher for lesser-studied organisms, where few signal peptides have been confirmed by experiment. For example, comparison of SignalP 4.0 predictions with proteogenomics data for the bacteria Caulobacter crescentus and Rhodobacter sphaeroides (Venter et al., 2011) revealed 21 discrepancies out of 429 (5%) and eight out of 354 (2%) respectively (data not shown).
In conclusion, proteogenomics proved to be an efficient high-throughput approach to signal peptide discovery, recovering in a single experiment more than one-third of all experimentally validated signal peptides and ∼ 20% of all signal peptides in E. coli. The criteria for filtering putative signal peptides (absence of C-terminal Lys or Arg, typical length, and the existence of a hydrophobic patch), while being purely empirical, allowed removal of all false-positives. Future analyses will show if these criteria could be modified in order to improve sensitivity without losing the selectivity of the signal peptide recognition. We also note that the total number of signal peptides for E. coli proved to be on the order of 10%, much lower than previously published estimates, but consistent with the predictions of the most recent version of SignalP program (Petersen et al., 2011). Similarly low (lower than assumed previously) numbers of signal peptides could be expected also in other microorganisms.