Proteolysis is an important aspect of high-throughput proteomics, since the analysis in these types of experiments is entirely centered on peptides . It is therefore interesting to determine which peptides will get reliably cleaved from a specific protein, since only these peptides will have a chance to enter the LC-MS pipeline and be subsequently detected. It has long been assumed that the protease trypsin, the most frequently used cleavage workhorse in proteomics research , cleaves after arginine or lysine residues except when these residues are C-terminally flanked by proline, with the occasional missed cleavage occurring as well. Three distinct attempts to model the enzyme's activity more closely to the physical reality are provided by Siepen et al. , the MC:pred software , and the CP-DT tool . In each of these cases, the objective is to improve identification rates in shotgun proteomics, and/or to provide a priori predictions of suitable peptides for quantitative (targeted) proteomics analyses. Predictive models have also been developed for other proteases , but since these focus on the biological role of the proteases, these are not discussed in detail here.
The former two predictors, Siepen et al.  and mc:pred  provide a score for each tryptic site that ranges from 0 to 1, with 1 meaning it will remain uncleaved. The score is constructed with the intent to remove randomly cleaved peptides from a list of potentially detectable peptides in targeted analyses, or to reduce the identification database to speed up discovery. CP-DT  does the inverse, and outputs a list of peptides ranked by the probability that they will occur in the specific cleaved form after tryptic proteolysis. Note that this means that CP-DT can be used not only to provide probabilities for correctly cleaved peptides, but also for peptides containing any number of missed cleavages.
All three predictors only use positional information about the amino acid sequence around the tryptic site as features in the model. Neither chemical information about the molecular structure of the peptide nor any information about the conditions of the proteolytic digest, which can be of influence in the digestion efficiency , is used.
When the performance of these algorithms is tested on publicly available data however, a difference in performance can be seen. This is not only due to the use of different underlying algorithms—an information theory-based strategy by Siepen et al. , an SVM by Lawless and Hubbard , and an ensemble of decision trees by Fannes et al. —but also due to differences in the filtering of the training data sets that were used in the learning stage.
We compared the performance of the three models by an ROC curve (Fig. 2), relating the true positive rate to the false positive rate at different thresholds of predicted cleavage probabilities. For each case (i.e., predicted correctly cleaved tryptic site) the example is considered negative if it occurs only inside an identified peptide as uncleaved, and positive if it occurs at least once at the C-terminus of an identified peptide in the test data sets. The ensemble classifier is seen to outperform the two other models in both high recall and low fall-out regimes in this analysis, offering a higher accuracy over all possible thresholds for the two data sets considered.
Figure 2. The receiver–operator curves of the three different predictors, with AUC-ROC the probability that the classifier will score a randomly drawn positive sample higher than a randomly drawn negative sample. Curves generated with ROCR .
Download figure to PowerPoint
Retention time prediction is one of the first problems in LC-MS to which statistical modeling was applied, and remains one of the more difficult issues. Several slightly older reviews on existing efforts are already available for this topic [46, 47], each comparing the “gold standard” SSRCalc algorithm  to newer efforts. SSRCalc is based on a simple additive model that calculates retention time as a weighted sum of retention coefficients for the individual residues in a peptide and then corrects for empirical factors such as length influence and the tendency to form helical structures. Yet despite this apparent simplicity, these comparisons show that SSRCalc produces fairly accurate predictions and stacks up relatively well (albeit after inclusion of some additional correction factors such as peptide charge [49, 50]) compared to more complex analytical models .
Machine learning models have also been used to address the retention time prediction problem, including two artificial neural networks-based models by Petritis et al.  and Shinoda et al. , and two SVM regressors by Klammer et al. , and RTPredict . This latter algorithm is based on an oligo-kernel that is integrated into OpenMS .
Compared to these machine learning algorithms, SSRCalc has the advantage of simplicity. Since the variability in combinations of LC columns, gradients, solvents, and conditions is so large, even the best model will need to be retrained or recalibrated to the conditions at hand. The more complex models suffer from the disadvantage that, while their predictive power is higher once they are correctly trained, they are slower to train and need significantly more examples to reach this power (ranging from thousands  to hundreds ). As a result, the costs of training such an algorithm, that is, running many samples on a specific column, may well outweigh the improvements in predictive accuracy offered by the more complex algorithms.
A more recent review book chapter  covers the latest developments in this subdiscipline, and links to a web tool that can be used to evaluate the performance of predictors by generating scatter plots of measured retention time versus predicted retention time. Currently, the web site only offers this functionality for the authors’ own predictor called rt , a linear model. In their book chapter, the authors correctly note that in earlier reviews there is little agreement on a metric to compare retention time predictors, and they therefore propose standard error between observed and predicted retention times as an appropriate measure. In Moruz et al.  however, where the most recent versions of SSRCalc  and RTPredict  are compared, a different metric is proposed. Moruz et al. use the minimum time deviation in minutes between observed and predicted retention time for 95% of the peptides δt95%. We believe this metric to be a more useful comparator, since it provides direct information about the overall performance of the model, giving users an idea of the accuracy of the typical prediction. Note that the deviation can also be given as a percentage of the total run time rather than in minutes.
Moruz et al.  also introduce Elude, an—optionally completely re-trainable—pretrained SVM model for peptide retention time prediction. Elude can be calibrated to a condition at hand, requiring about 200 peptides to achieve maximum accuracy, similar to what SSRCalc needs. It includes 60 features about the amino acid sequence, including hydrophobicity and helicity, but no information about charge. In the evaluation of the model, an average deviation window of 22.06% from the total length of the chromatographic run is measured for Elude, compared to 25.68% for RTPredict and 27.79% for SSRCalc. For a 90 min run this results in an improvement of 5 min, a substantial gain that can significantly decrease false positive rates for identifications when used as a means to triage PSMs.
Recently, an improvement to Elude has been published, enabling it to deal with peptides bearing five kinds of PTMs that affect peptide retention time . Changes to the core algorithm of Elude to allow the inclusion of this additional parameter also led to a reduction in the deviation window for both modified and unmodified peptides, down to about 18%.
For completeness, it is worth mentioning that the Trans-Proteomic Pipeline (TPP)  is also able to predict retention times, using an unpublished but open source artificial neural network.
2.3 Peptide detectability
Unfortunately, not a single proteomics approach is currently able to reveal the entire population of the eluting peptides generated by proteolytic processing of a complex protein mixture. We can define peptide detectability as the probability that a given peptide will be observed in a standard sample analyzed by a standard proteomics routine . Peptide detectability is determined by several factors, of which we can roughly distinguish four classes : (i) the physico-chemical properties of the peptide (mass, hydrophobicity, ability to ionize or fragment); (ii) the limitations of the analytical workflow (including sample preprocessing, MS instruments, and software); (iii) the abundance of the peptide in the sample; (iv) the other peptides present in the sample that compete with this peptide for subsequent detection or identification. Taken together, these factors determine whether a peptide will be detected in a particular experiment, and can also lead to variations in the detectability of peptides across experiments.
In order to predict peptide detectability, several supervised classification techniques have been applied, each attempting to model all the above factors at once. Tang et al.  used ensembles of 30 one-hidden-layer feed-forward neural networks, reporting a single accuracy estimate. Initially they identified 175 features, after which unpromising features were removed using a t-test filter. Additionally, the number of correlated features was reduced using principal component analysis, retaining 95% variance. Later, this work was extended to include an estimation of protein abundance by using an iterative quantity adjustment .
Another approach was proposed by Lu et al. [66-68] with the APEX (Absolute Protein Expression) tool. Their peptide detectability classifier was constructed from a random forest, using feature vectors consisting of 35  to 66 features . Here again, peptide detectability prediction was used as an intermediate step in the overall task of protein quantification. After prediction of peptide detectability, protein quantification was performed by normalizing the observed spectral count by the predicted count.
The approach by Mallick et al.  predicts whether or not peptides are proteotypic. They provided several distinct predictors, each specific to a particular experimental design. Mallick et al. identified 494 amino acid features, which were subsequently summed and averaged for each peptide, resulting in almost 1000 distinct features. Next, based on the Kullbach–Leibler distance and the Kolmogorov–Smirnov distance, the smallest descriptive subset of properties was determined. Based on this subset, a Gaussian mixture discriminant function was developed from the training data. This application is named PeptideSieve.
Similar to the approach of Tang et al., Sanders et al. employ a neural network as binary classifier, called PepFly . However, they do not provide a single general classifier, instead they mention that the classifier should be retrained for each distinct experimental setup. Sanders et al. identified 596 features, of which a reduced set is calculated using a greedy search through feature space, after which the selected features are used to construct a one-hidden-layer feed-forward neural network.
Wedge et al. on the other hand use a genetic programming classifier . They selected 393 peptide properties for their initial genetic program. Afterwards, based on the usage of input nodes in this initial program, the set of input nodes was decreased to 34 and 6 input nodes.
ESPPredictor by Fusaro et al. is based on a random forest to predict high-responding peptides . They identified 550 physico-chemical properties as features. In contrast to the other approaches however, no feature selection is performed here. Instead, all features are used to construct a random forest consisting of 50 000 trees. The output is a probabilistic value, corresponding to the fraction of trees that predict the peptide to be detectable.
Next, the predictor by Webb-Robertson et al., named STEPP (SVM Technique for Evaluating Proteotypic Peptides) uses a SVM with a quadratic kernel . Out of 35 initial features, feature selection is performed based on the Fisher Criterion Score.
Finally, Eyers et al. combined several of the above-mentioned types of classifiers in order to create a consensus algorithm named CONSeQuence . Concretely, a random forest, a genetic program, an artificial neural network, and a SVM are each individually used. Afterwards, a specific number of votes from each of the four predictors is required in order to classify a peptide as detectable or not.
As shown, several algorithms have thus been used to solve the peptide detectability problem, with all of them relying on a different range of features. However, appropriately training the classifier plays a crucial role in obtaining a good performance. Particularly, a classifier trained on a data set pertaining to a specific experimental setup, will only perform optimally for the same experimental setup. Indeed, Mueller et al.  showed that the PeptideSieve predictor by Mallick et al.  was less accurate when applied to a different data set. In addition to training the model on an appropriate data set, a careful selection of the training data is also required, that is, obtaining a balanced set of positive and negative examples.
We compared their performance in predicting the actually detected peptides out of a list of peptides generated by an in silico cleavage of the data set by the Keil rules . That way we are able to compare classifier software (Peptide Detectability Predictor, PeptideSieve, and CONSeQuence) that take proteins as input and perform cleavage themselves. Furthermore, to ensure a fair comparison, for peptide generation, no uncleaved sites were allowed; Peptide Detectability Predictor does not, unlike all other predictors, offer the possibility to set the amount of uncleaved sites. Several potential peptide candidates are thus not included in the analysis. However, the mutual differences obtained between the various predictors should remain valid. In addition to the previous parameters, the minimal peptide sequence length was set to five, PeptideSieve was run using the PAGE_ESI experimental design choice, and the rank score prediction type for CONSeQuence was used.
An example peptide is considered positive if the predicted peptide was actually observed in the data set. Similarly, an example is considered negative if it was not observed in the data set. The ROC analysis in Fig. 3 shows that the predictors Peptide Detectability Predictor, PeptideSieve, and CONSeQuence perform very similarly over the two data sets, while ESPPredictor seems to be the best choice when low fall-out is desired, and STEPP appears to be the best choice when high recall is desired.
Figure 3. Comparison of peptide detectability classifiers based on an ROC analysis. For the iPRG data set, the AUC for ESPPredictor and STEPP are similar, ESPPredictor performs better at low false positive rates, while STEPP is better in the high recall regime. ESPPredictor is grayed out for the CPTAC data set because the predictor is trained on data from that same study.
Download figure to PowerPoint
2.4 Peptide identification
To match MS/MS spectra to the peptides that generated them, database search methods, known as search engines, constitute by far the most popular approaches. The databases that are searched in these approaches can either contain the peptide sequences expected to be observed in the sample , or (processed) MS/MS spectra that have been observed and identified in previous experiments . In both cases a measure of similarity is required that scores candidate peptides against an experimental MS/MS spectrum. However, the best scoring PSM can still be wrong. The challenge is to separate the correct PSMs from the incorrect ones, which is a typical machine learning task. A first attempt to find a good separation between correct and incorrect PSMs is to cluster the matching score distribution with a probabilistic mixture modeling method . This clustering approach was further improved by representing PSMs as complex feature vectors that contain more information about the proteomics experiment such as the expected number of matches from a given database, the effective database size, a correction for indistinguishable peptides, or a measurement of match quality [80-84]. This algorithm is integrated in the TPP under the name PeptideProphet .
To solve the match separation challenge as a supervised learning task, one requires examples of both correct and incorrect PSMs. Such a data set can be obtained from purified proteome samples. In this case the expected peptide identifications are considered true matches, whereas unexpected identifications are considered false matches. These rather limited data sets [79, 85] allowed for supervised methods to compute accurate classifiers (as compared to the clustering approach) using SVMs [86-88], random forests , and neural networks .
Examples of correct and incorrect PSMscan also be obtained from the actual proteomics experiment itself. By searching MS/MS spectra against both a target as well as a decoy database  one obtains examples of incorrect PSMsfrom the decoy database. Matches against the target database constitute both correct and incorrect matches. A semisupervised SVM is shown to learn accurate PSM classifiers from these data using an iterative SVM learning procedure, initially identifying a small set of high-scoring target PSMs, and then learning to separate these from the decoy PSMs. The learned classifier is applied to rescore the entire set, and if new high-scoring PSMs are identified, then the procedure is repeated; this algorithm is called Percolator [9, 91, 92]. By using a special loss function this task can be solved more accurately using a supervised learning algorithm. This loss function does not severely penalize examples that are far from the decision boundary such that accurate nonlinear SVM classifiers can be learned, even though incorrect target PSMs labeled as correct are present in the data set .
A similar approach where PSMs are represented by feature vectors can be used to combine the results of different search engines to obtain more accurate PSM scores . The best results are obtained by computing complex features from the observations made by all search engines combined and applying linear discriminant analysis (iProphet ) or random forest learning (PepArML ) in a similar semisupervised way as described.
The fragmentation of a peptide using methods like CID produces signal spectra that contain information about the chemical dissociation pattern of the fragmented peptide. The signal peaks in an MS/MS spectrum indicate the presence of a peptide fragment ion with a specific mass, and the intensity of such a peak is dependent on a number of factors such as the abundance of the peptide in the sample and the efficiency of the bond breaking that generated the fragment. Also at play are the proteotypicity of the fragment ion, as well as factors related to the peptide and the instrument that generated the MS/MS spectrum . Being able to predict signal peak intensities is important for the understanding of the patterns behind peptide fragmentation.
Elias et al. implemented an inductive Bayesian decision tree approach to model peptide fragmentation and showed that a decision tree model representation is highly suitable for learning the diverse set of rules that govern peptide fragmentation . Their data-driven approach was able to extract, from 27 000 PSMs, many of the known fragmentation rules and discovered several new ones. However, their approach does not model the peak intensities directly. Rather it models the probability of observing a certain fragment ion intensity bin. A similar study based on Bayesian neural networks was presented in Ref.  with a data set of 13 900 PSMs.
Approaches that model peak intensities directly exist as well. PeptideART  implements an ensemble of neural networks that each model the most important fragment ion peak intensities in a multioutput feed-forward one-hidden-layer neural network. The features used as input to the neural network are very similar to those suggested by Elias et al. The authors reported a systematic assessment of the accuracy of the current peptide MS/MS spectrum predictors for the most commonly used CID instruments . They found that PeptideART achieves generally higher accuracy on a wide range of proteomic data sets when trained on a data set of 41 054 PSMs.
Another promising approach is to predict the intensity ranks of the fragment ions in an MS/MS spectrum. Frank et al. implemented a discriminative ranking-based model that applies boosting to model the relationship between simple sequence-based features and the observed peak intensity ranks .
A recent approach predicts an entire fragmentation spectrum at once based on a weighted K-nearest neighbors algorithm applied to a spectral library . The distance function used by the nearest neighbor method is learned with an SVM, and the method is used to complete a spectral library and improve results of search engines.
2.6 Protein inference
In a peptide-centric proteomic approach, proteins are identified based on peptide homology . Previous statements accentuate that proteins are not analyzed directly by MS, but by partial analysis of fragment ions of their peptides, which are obtained after proteolysis and separation via LC. The connectivity between proteins and peptides is thus lost in various ways during these steps in the overall analytical process. Protein inference is then the reassembly process that identifies and characterizes (including PTMs) the proteins in the sample based on the observed evidence, that is, the identified peptides. The challenge in protein inference is to correctly identify these precursor protein sequences, since a limited set of peptides may be assigned to multiple proteins. The protein inference problem is well described in the literature [10, 104-106].
It is worth mentioning that the main effort in protein inference is not related to spectral matching as described in the section on peptide identification, but on the induction of protein to peptide connectivity after peptide identification. Therefore, mostly information about peptide identification and protein accession numbers is used to solve the inference problem, as depicted by the emphasized bipartite graph in Fig. 4. A trivial solution to this problem could be the most complete set of proteins that correspond to the observed peptides. In contrast, a more reductionist approach is to adopt the principle of Occam's razor and to report the minimal sets of proteins that could explain the observed peptides. However, it is not likely that previous solutions entirely reflect the underlying natural process, and probably more evidence can be incorporated in the computational approach toward more directed protein inference. For example, auxiliary and derivative information from the PSMs, for example, delta mass, missed cleavage, charge, competing peptides for the same spectrum, could gain support for a particular protein, while information about the spectral data layer is often ignored in current protein inference algorithms. For completeness, we graphically illustrate the inference problem in Fig. 4 with its available meta-information. It is natural to represent the inference problem as a probabilistic relational model [18, 107]. From the figure it is clear that there are three layers of information. Each layer represents a step in the inference process. The nodes in the layers represent: (i) spectral data: m/z, RT, intensity, total ion count, etc; (ii) peptide sequence annotation: mass, calculated pI, predicted fragment ions, uncleaved sites, etc; and (iii) protein information: mass, number of peptides, sequence length, PTMs, protein group, etc.
Figure 4. A bipartite graph representing the three layers of information in the protein inference process. Layer I being the spectral data layer, layer II the peptide sequence annotation layer (PSM), and layer III containing protein information. The edges between the layers contain the information that can be used for protein inference; examples are given under the edges.
Download figure to PowerPoint
The edges between the layers also contain information that can be used for protein inference: edge 1–2, PSM: number of fragment ions matched, PSM score, p-value, e-value, rank, etc.; and edge 2–3, peptide–protein connection: parsimony, proteotypic peptides, peptide observability, intensity-abundance relation, etc.
To determine a minimum set of peptides that can be uniquely assigned to proteins, the one- or two-peptide rules on only a small set of highly confident first-ranked peptides are often employed . These simple rules however only allow for identification of a very small portion of the proteins in a sample. Most modern inference methods try to incorporate more of the available information in combination with a few heuristic rules. Sensu stricto, these heuristic methods cannot be classified as machine learning methods [109-117]. We review a set of implementations rooted in machine learning later.
Usually methods inferring a minimal set of proteins that covers the most likely and confidently identifiable peptides are restricted to the bipartite graph [118-120] as depicted in Fig. 4, but Spivak et al. recently  proposed a single optimization problem that also incorporates information about the spectral layer. The approach was motivated by the observation that the peptide and protein level tasks are cooperative, and the solution to each can be improved by using information about the solution to the other. This relation is illustrated by a feedback edge that connects the protein layer with the spectral layer in Fig. 4. The approach is implemented in the Barista-tool of the Crux toolkit  and relies on an artificial neural network to solve the optimization problem.
Other methods, borrowed from language modeling can be used to for the inference problem as well. Yang et al.  apply a vector space model, used in information retrieval to model document identifiers, to model peptides as protein identifiers.
An interesting development is the use of peptide detectability in the inference problem [64, 65, 124], for which the above section covers the available algorithms. In this context, the graph representation of the protein inference problem can be modified by adding probabilities to a peptide–protein connection. Doing so allows information about detectable but unidentified peptides to be included in the inference.
Another development is the use of information about peptide intensity and protein abundance. This approach was first described to solve the shared peptide problem in protein quantification , however, it allows for pruning the degenerate peptides as well. As such it can be considered that the protein inference problem is generalized by the protein quantification problem [97, 125, 126].
The plethora of proposed solutions performing protein inference  serves as a good illustration of the magnitude of the problem. Validation of protein inference and quantification is however still a topic of ongoing research .