DDASSQ: An open‐source, multiple peptide sequencing strategy for label free quantification based on an OpenMS pipeline in the KNIME analytics platform

Abstract In this study we investigated the performance of a computational pipeline for protein identification and label free quantification (LFQ) of LC–MS/MS data sets from experimental animal tissue samples, as well as the impact of its specific peptide search combinatorial approach. The full pipeline workflow was composed of peptide search engine adapters based on different identification algorithms, in the frame of the open‐source OpenMS software running within the KNIME analytics platform. Two different in silico tryptic digestion, database‐search assisted approaches (X!Tandem and MS‐GF+), de novo peptide sequencing based on Novor and consensus library search (SpectraST), were tested for the processing of LC‐MS/MS raw data files obtained from proteomic LC‐MS experiments done on proteolytic extracts from mouse ex vivo liver samples. The results from proteomic LFQ were compared to those based on the application of the two software tools MaxQuant and Proteome Discoverer for protein inference and label‐free data analysis in shotgun proteomics. Data are available via ProteomeXchange with identifier PXD025097.

well as their relative or absolute amounts through different computational approaches.
To enhance quantitative MS accuracy, methods based on sophisticated experimental designs such as stable isotope labeling by amino acids in cell culture (SILAC) [4] and isobaric labeling methods including tandem mass tags (TMT), isobaric tags for absolute and relative quantification (iTRAQ) and dimethyl labeling have been introduced [5].
However, due to the additional time needed to carry out sample processing coupled to the elevated costs to perform these procedures, label free quantification (LFQ) strategy remains the prominent option for the analysis of proteomics-based studies [6].
In this context, the application of combined multiple engines presents technical and computational challenges, including their heterogeneity in terms of scoring for identification quality control, the propagation of false discoveries, as well as conspicuous informatics challenges related to the different data formats employed by each software. To tackle these hindrances, integration tools like iProphet and Scaffold have been developed [16,17].
In this context, Vaudel et al. reported SearchGUI, an opensource graphical user interface that allows to configure and run the freely available search engines OMSSA and X!Tandem [18], and Pep-tideShaker, a search engine platform for the interpretation of results from multiple search (X!Tandem, MS-GF+, MS Amanda, OMSSA, Myri-Match, Comet, Tide, Mascot, Andromeda, MetaMorpheus) and de novo (Novor, DirecTag and mzIdentML) engines [19]. Kwon et al. published MSBlender, a statistical method for the integrative analysis, which is based on the conversion of raw search scores from different database-assisted search engines (InsPecT, Myrimatch, SEQUEST and X!Tandem) into a probability score for every possible PSM, thus accounting for correlation between search scores and estimating false discovery rates, leading to more PSM identifications than any single search engine at the same false discovery rate [20].
The authors showed that increased identifications improved spectral counts for most proteins and allowed the quantification of proteins that would not have been quantified by individual search engines. Of note, they also demonstrated that enhanced quantification contributes to improved sensitivity in protein differential expression analyses [20].
On a similar line, Zhao et al. reported an efficient identification strategy based on the application of multiple peptide search engines, highlighting the similarity between their proteomic results with those of highly accurate RNA-seq quantifications [21]. Audain et al. reported a bioinformatics solution based on the KNIME/OpenMS platform to compare the performance of protein inference procedures like PIA, Protein-Prophet, Fido, ProteinLP, and MSBayesPro using three database search engines Mascot, X!Tandem, and MS-GF+ [22].

Statement of significance
The identification of peptide sequences for protein quantification represents one of the crucial steps in development of shotgun proteomics experiments. Here, we describe the general impact of combining multiple peptide search engines working on different theoretical and applicative principles, on the protein identification and quantitation performance of a pipeline built in an opensource proteomic platform.
Therefore, the results are compared with those generated by the two well established proteomic softwares Proteome Dis-covererTM and MaxQuant®. On the same line, taking a conceptual step forward, recently Mohammed and Palmblad developed a theoretical framework and an automated data processing workflow including different peptide identification methods based on a bioinformatic platform known as Taverna [23]. In this study, the scoring results generated by sequence database search (X!Tandem), were compared and combined with those from spectral library search (SpectraST) and de novo sequencing (Pep-Novo) algorithms, helping the discrimination of corresponding correct and incorrect peptide identifications.
The aim of this study was to evaluate the protein quantification performance of a proteomic pipeline for LFQ analysis based on the concept of combining multiple peptide search engines which work on different theoretical and applicative principles. The sequential combination of the de novo peptide sequencing approach (Novor algorithm), of two in silico tryptic digestion assisted database-searching assisted parsers (X!Tandem and MS-GF+), and of the consensus library search-based peptide identification (SpectraST), were tested through their adapter node versions in the open-source OpenMS software available in the analytics platform KNIME (Konstanz Information Miner) [24,25]. We will refer to the workflow based on this approach as DDASSQ (De novo, Database Assisted, Spectral Search and Quantification).
Seeking for further insight into the behavior of proteomic workflows in generating LFQ results, we first tested the performance of search engine combinations and evaluated the quantitative result. Then, the corresponding protein LFQ computed on different proteomic datasets F I G U R E 1 Layout of the tested multiple search engines proteomic OpenMS-based pipeline was benchmarked and compared with that obtained using two extensively tested and popular software tools, MaxQuant (MQ) [9] and Proteome Discoverer (PD).

DDASSQ accuracy: Spike-in protein datasets
The general structure of the DDASSQ workflow in which LFQ is achieved applying the four peptide search engines X!Tandem, MS-GF+, Novor and SpectraST, is shown in Figure 1.
The precision and accuracy of the DDASSQ workflow was tested  Figure 2D). This effect was more evident in D2, with a non-significant difference between the lowest tested concentrations only ( Figure 2I). Accordingly, the pairwise variation-ratios corresponding R-values moved partially toward negative values ( Figure 2E and  The analysis evidenced a similar level of accuracy ( Figure S2A and Figure S2D), with better performance of PD at lower spike-in amount range (Dataset D2, Figure 2SA-C) and higher overall sensitivity of DDASSQ in the higher spike-in amount range (dataset D1, linear regression slope value: 0.5601, Figure S2E) compared to PD (linear regression slope value: 0.3776 Figure S2F).

Characteristics of in-house input files
The LC-MS chromatographic profiles from duplicate analysis of proteolytic peptides obtained from fraction F1 and F2 are reported in Figure S3-S6. The chromatograms intensities of peaks falling across almost the entire retention time window indicated that the fractionation process led to the recovery of a lower quantitative amounts of peptides in F2 comparing to fraction F1. Under these conditions, it was reasonable to expect differential LFQ values higher in F1 compared to F2.

Proteomic tools performance: General outcomes
The collective results of total number of quantitative proteins and the total number of quantitative peptides identified and selected for LFQ by DDASSQ, PD and MQ are reported in Table 1.

Impact of search engines combination on protein selection for LFQ
To better understand the contribution of each search engine (i.e., peptide search criteria/approach) to the overall DDASSQ pipeline performance, the workflow was modified by sequential exclusion of the peptide search nodes according to the results layout reported in Table 2.    Table 2) The introduction of SpectraST in the pipeline was responsible for the 43.7% of the protein identifications reported in the LFQ list. The increase in overall number of protein identifications was paralleled by a significant increase in the corresponding total estimated intensities in both fractions F1 and F2, with F1 fraction total intensities higher than those computed for fraction F1 (Table 2).
Taken all together, these results confirm the capacity of the combined peptide search strategies (de novo peptide sequencing, databaseassisted search and spectral searching) to yield higher numbers of identified peptides as well as improved identifications, which ultimately should lead also to significant improvements in terms of protein LFQgenerated quantitative data.

DDASSQ/PD/MQ LFQ correlation results
The concordance of protein LFQ computed by the three proteomic tools DDASSQ, PD and MQ (n = 1294 shared proteins) is visualized in  Figure 6A and Figure 6B, respectively).
The log-transformed data showed significant correlation between the DDASSQ LFQ-Δ values with those from PD (R = 0.948, p < 0.001, Pearson product-moment, two-sided, Figure 6C) and MQ (R = 0.907, p < 0.001, Pearson product-moment, two-sided, Figure 6D), respectively.  Table 1 for details). Of note, and above all, the best fit linear curves for PD/DDASSQ data showed intercept value close to zero ( Figure 6D), while those for MQ/DDASSQ ( Figure 6E) and MQ/PD ( Figure 6F) showed corresponding negative intercepts, suggesting that in both cases the OpenMS and PD proteomic workflows produce values with n = +2 incremental LFQ log 2 units in respect to those generated by MQ. According to these results, to all the proteins in this range with positive log 2 -fold value (namely an up-regulation) found by DDASSQ and PD will correspond a negative value (down-regulation) assigned by MQ (downregulation).
The origin of this apparent systematic error remains to be established.

DISCUSSION
In the present study, the performance of an LFQ proteomic workflow, based on the combination of three different peptide identification approaches, was evaluated on two different previously reported LC-MS proteomic datasets as well as on in-house available dataset obtained from mouse liver protein extracts.
The proposed workflow was built using the OpenMS/KNIME adapters of the peptide search engines Novor, X!Tandem, MS-GF+ and SpectraST, all working through their specific nodes developed in the KNIME platform [24,25].
Recent studies reported the impact of the combination of some database-assisted peptide search tools, working independently within online or in installation-based computational platforms on the identification of different peptide sequences. This approach increased peptide identification and protein amino acid sequence coverage, thus providing a relatively simple but efficient way to maximize the utilization of mass spectra through the combination of such combined peptide search engines [18][19][20][21][22][23].
From the quantitative point of view, the results reported support the concept that the improvement obtained by the application of multiple search engines strategy translates in a more accurate protein quantification, taking advantage of the higher number of proteins identified, with a performance similar to that of highly accurate RNA-seq approaches [21].
Based on these aspects, we aimed to test a composite proteomic workflow according to the hypothesis that its overall identification and quantification capacity at the proteome level can be improved by the combination of multiple peptide search tools based on radically different theoretical and informatic backgrounds, in line with the hypothesis proposed by Mohammed and Palmblad [23].
One of the goals was to design a flexible, user-friendly computational system allowing the management of several parameters involved in proteomic pipeline nodes without requiring deep knowledge of their underlying informatic grounds. From this point of view, the OpenMS tools built in the KNIME platform seemed to be an ideal starting point.
Therefore, among those available in the OpenMS/KNIME platform, we first selected the adapter of Novor, one of the commercial software packages working on an algorithm which allows de novo peptide sequencing: that is, peptide sequencing is deduced directly from MS/MS data without requiring reference sequence database(s) [8].
The de novo peptide sequencing was combined with two databaseassisted search algorithms (X!Tandem and MS-GF+). X!Tandem, as reported in its original version by Craig and Beavis [11] searches peptide structures starting from tandem MS/MS spectra with the aid of in silico tryptic digestion of target protein sequences. Beside X!Tandem, the more recent sequence database-assisted sequencing search engine MS-GF+ tool was included in the combination [12,13].
One significant advantage of this search engine relies in its insensitiveness to the individual experimental set-up (low/high resolution, fragmentation mode), improving the identification performance compared to that of other informatic tools designed for specific instrumental solutions [13].
The fourth approach selected was that of SpectraST, a search tool developed by Lam and colleagues that employs spectral searching of the experimental data against a library of experimental annotated MS/MS spectra [14]. According to the authors, this procedure vastly outperforms the identification capacity of the sequence search engine SEQUEST, both in terms of computational speed and of ability to discriminate good and bad hits [14,15].
The combined identifications were used in the workflow for spectral features definition using the FFId algorithm reported by Weisser and colleagues [29] and subsequent protein inference for protein groups determination, and in parallel for PSMs extraction using the algorithm PIA described by Uszkoreit and colleagues [30,31]. The choice of FFid over other spectral feature identifiers was done based on its higher capacity in producing quantifiable proteins and its higher speed compared to other analogue tools in OpenMS environment, such as Fea-tureFinderCentroid. Protein quantification was then achieved through the ProteinQuantifier node, with an approach similar to that described by Silva et al. [32].
In all considered cases, computational descriptors (e.g., total number of identified peptides and proteins) of the LFQ were generally comparable or superior to those obtained using two common proteomic tools such as PD and MQ and X!Tandem, MS-GF+ in combination with Novor.
The obtained results agreed with previous findings on the determination of liver proteome of mouse strains with different genetic background [33]. For this reason, to better define the role of spectral searching in the performance of DDASSQ approach, future work will focus on expanding the application of this tool to a wider set of raw data with particular emphasis on the different tissue and cell type, the sample processing procedure and data file dimension.

CONCLUSIONS
In increasing the availability of spectral consensus databases currently limited to a small number of species, will allow more feasible the application of algorithms such as that used by SpectraST; this calls for further extensive work of spectra collection and compilation.

Data sets
Computations were run on dataset representing the LC-MS analysis of tryptic digests of protein extracted from mice liver fed a cholesterol enriched diet [35] and processed as described in the next paragraphs.

Animals
Wild type (WT), male mice on C57BL/6J background were purchased from Charles River (Italy) and The Jackson Laboratory (USA). Old mice

Sample preparation
Liver segments from WT mice (n = 2) were cleaned with sterile ice-cold

DDASSQ workflow
Prior to data analysis, each LC-MS raw file was converted from raw to mzML format in centroid mode using the MSconvert tool of the software ProteoWizard (version 3.0.1957) [39]. The mzML files were analyzed using a pipeline adapted from Weisser et al. [

Benchmark proteomic softwares
Proteomic data analysis was done using the softwares Proteome Discoverer (PD, version 2.2, Thermo Fisher Scientific, Waltam, MA, USA) and MaxQuant (MQ, version 1.6.7.0) [9]. The PD corresponding data processing workflow is described in the Appendix S2. Both PD and MQ analyses were run using a precursor mass tolerance of 5 ppm and fragment mass tolerance of 0.02 Da, carbamidomethyl as fixed modification and methionine oxidation as variable modification, with the same sequence database used for X!Tandem and MS-GF+ in the OpenMS workflow. Decoying was done in reverse sequence mode. Trypsin was selected for in silico protein digestion, n = 2 maximum number of missed cleavages, peptide length for unspecific search between n = 8 and n = 25 amino acids, and MQ LFQ and stabilize large LFQ options on. MQ iBAQ was not activated.

Datasets from PRIDE repository
The DDASQ workflow identification and quantification performance was tested using the datasets from two different studies. UPS µl added to 60 ng of S. cerevisiae background proteins [27]. Raw data are available for download at the URL https://cptac-data-portal.