Oktoberfest: Open‐source spectral library generation and rescoring pipeline based on Prosit

Machine learning (ML) and deep learning (DL) models for peptide property prediction such as Prosit have enabled the creation of high quality in silico reference libraries. These libraries are used in various applications, ranging from data‐independent acquisition (DIA) data analysis to data‐driven rescoring of search engine results. Here, we present Oktoberfest, an open source Python package of our spectral library generation and rescoring pipeline originally only available online via ProteomicsDB. Oktoberfest is largely search engine agnostic and provides access to online peptide property predictions, promoting the adoption of state‐of‐the‐art ML/DL models in proteomics analysis pipelines. We demonstrate its ability to reproduce and even improve our results from previously published rescoring analyses on two distinct use cases. Oktoberfest is freely available on GitHub (https://github.com/wilhelm‐lab/oktoberfest) and can easily be installed locally through the cross‐platform PyPI Python package.

These improvements are driven by rapid advances in MS [5] but also fuelled by the continuous development and improvement of bioinformatics workflows [6][7][8][9][10][11][12].Most commonly used search engines for data acquired using data-dependent acquisition (DDA) calculate scores largely based on the absence and presence of peptide fragments.In this process, experimentally observed spectra are matched against in silico generated spectra with unit intensity which allows the calculation of a variety of scores [13,14] that aim to separate correct from incorrect matches.Confident peptide-spectrum matches (PSMs) are then selected for downstream processing by filtering for a desired false discovery rate (FDR) using the target-decoy approach [15][16][17].
However, estimating the FDR solely based on the search engine score results in suboptimal separation since other matching features, such as mass accuracy of the proposed peptide and missed cleavages, are often not incorporated into the matching score.This is particularly apparent when applying more stringent FDR filters, often resulting in a loss of a vast number of PSMs.Machine learning enabled the use of multiple features by integrating them into a single score, often referred to as rescoring, by training a classifier that was tasked to separate correct from incorrect matches [18,19] based on the target and decoy score distributions.Tools such as Percolator [20] and mokapot [21] rely on search engines for generating a set of candidate PSMs for each spectrum without setting an FDR filter in advance.Subsequently, rescoring is applied resulting in a more efficient separation of correct and incorrect matches which increases the number of retained PSMs at a given FDR filter.
Nevertheless, the separation of correct and incorrect matches is not optimal and depending on the application, a large number of (likely) correct PSMs is lost due to shortcomings of the heuristics applied to calculate, for example, the number of matching fragments.This is in part due to the lack of a ground truth spectrum to establish what a correct spectrum should look like.Again, machine learning provided a solution to this by training models that were tasked to accurately predict, for example, the expected fragment intensities in tandem mass spectra.Particularly over the last years, a number of high quality predictors for fragment intensities, such as Prosit [22], AlphaPept [23], and MS2PIP [24], and other peptide properties, such as retention time (RT) [25], were released.This reignited the development of the data-driven rescoring workflows [26], in which additional scores (e.g., spectral similarity or ΔRT) are calculated and integrated into the rescoring workflow.
Among the available measures, the similarity between experimentally observed and predicted fragment intensities has gained particular attention as one of the most important metrics to estimate the quality of a PSM [22].Data-driven rescoring increases the number of confidently identified PSMs by a substantial amount which resulted in the development of multiple tools such as MS 2 Rescore [27], Inferys Rescoring [28], Inspire [29], and MSBooster [30].We have developed Prosit, a neural network for peptide property prediction which is available via a web interface that allows users to perform the rescoring workflow on MaxQuant results or to generate in silico spectral libraries.Since its release in 2019, the web interface has generated an estimated 9 billion fragmentation spectra for external users.However, due to technical restrictions, the service is limited in the number of files and the supported file size of the uploaded data and thus limits usability to very small projects.
Here, we release Oktoberfest, an open-source, standalone, reusable, and interoperable Python package of the backend used by the Prosit web service.Oktoberfest allows rescoring of unlimited files in parallel, supports most search engine outputs, and has no restriction on the input required for spectral library generation.It retrieves predictions either from our existing Prosit-service or a locally hosted version.Oktoberfest also aims to provide a documented API that provides access to low-level functions that can be imported into other applications.
Oktoberfest-based rescoring has been shown to consistently increase proteomic coverage, showing substantial gains on the number of confidently identified peptides for immunopeptidomics data [31], metaproteomics studies [22], and full-proteome analysis for the development of a mouse draft proteome [32], and is applicable to TMT-labelled [33] and iTRAQ/TMTPro-labeled data [34] as well.Additionally, Oktoberfest-based rescoring was recently applied on data from the "Proteomes that Feed the World" project, which aims to map the proteomes of the 100 crop plants most important for human nutrition.For this dataset, data-driven rescoring increased peptide coverage by 40% and protein coverage by 25% across 18 tissues for a draft tomato proteome [35].Spectral libraries generated by Oktoberfest can be used for the analysis of data acquired using DIA, showcased for non-model organisms and non-canonical databases [36], or enable full proteome spectral library searching workflows as recently showcased in Scribe [37].

A standalone, open source package
Oktoberfest is used as the backend of the online service for Prosit provided at ProteomicsDB (https://www.proteomicsdb.org/prosit/).
However, since the online service was limited to rescore one file only, applying data-driven rescoring on large studies was not possible.
Hence, we have now fully transferred our previous code base to a public GitHub repository for Oktoberfest (https://github.com/wilhelmlab/oktoberfest)as well as its two dependencies, spectrum-io (https:// github.com/wilhelm-lab/spectrum_io),and spectrum-fundamentals (https://github.com/wilhelm-lab/spectrum_fundamentals),all available under the MIT license (Figure 1).We also provide documenta- All three packages are automatically tested against the newest Python versions on all three major platforms (Linux, Windows, and MacOS).
Oktoberfest is implemented as a wrapper around spectrum-io and spectrum-fundamentals (Figure 1), that provide specific standalone functionalities, enabling independent reuse, and integration.The package spectrum-io is dedicated to handle tasks related to reading and writing files (input/output) including file conversion tasks, for example, converting the proprietary vendor format .RAW to .mzMLfiles.
For this task, Oktoberfest uses the ThermoRawFileParser [47] library that runs natively under Windows and under MacOS and Linux using mono (https://www.mono-project.com/).The package spectrumfundamentals implements core functionalities for spectrum annotation, metric calculation, fragment mass calculation, and processing of search engine results.For rescoring, Oktoberfest retrieves predictions from our the online prediction service Koina (https://github.com/wilhelm-lab/koina) and offers three workflows for (1) spectral library generation, (2) calibrating normalized collision energy (NCE), and data-driven rescoring of search engine results.Predictions can either be retrieved from a freely accessible, public Koina instance, such as the one we provide at https://koina.proteomicsdb.org.This enables Oktoberfest to run on any standard notebook since hardware require-ments can be kept low through online predictions using a selection of provided models through Koina.Alternatively, predictions can be retrieved from a self-hosted version of Koina, for which we provide a docker container (https://koina.proteomicsdb.org/docs), which may require a GPU server depending on the hosted models.

Data driven rescoring, library generation, and collision energy calibration
Oktoberfest supports three workflows.Starting with data-driven rescoring, this workflow requires MS and search engine result input files (Figure 1).While the online service of Prosit only allowed users to rescore MaxQuant results, Oktoberfest supports rescoring results from MSAmanda, Mascot, MSFragger, and Andromeda natively.Internally, all search engine files are converted to a minimal search engine result file (.prosit) that can be provided directly by the user, thus allowing the rescoring of any search engine results since input-file conversion can be done with minimal effort in spreadsheet editing tools, for example, Microsoft Excel.Similarly, MS data in .RAW format is converted automatically to the .mzMLformat [48], a widely accepted community standard that can be easily processed using existing parsers (e.g., Pyteomics and pymzML) but can also be provided in this format directly.
Once the input is read and converted, the search results are split by .mzMLfile, leaving only those search results for which an .mzMLfile is provided and omitting all others.This allows rescoring of subsets of data and subsequent parallel processing at the .mzMLfile level.
After reading, the search results are merged with the corresponding MS/MS spectra, using the MS file name and scan number.Fragment peaks are annotated based on the PSMs provided by the search engine results.For annotation, the expected fragment peaks (b and y ions) in charge states 1-3 for every given PSM are calculated and matched to the experimentally determined data using a mass tolerance that can be provided by the user in either Dalton or ppm unit.If the mass tolerance is not provided, default values are used that are determined by the mass analyser read from the scan header, that is, 0. Since the effective NCE used for fragmentation varies from machine to machine despite similar NCE settings [49] and to ensure maximal performance of Prosit [22], the optimal NCE for prediction needs to be determined, which is also available as a separate workflow.This is achieved by calibrating Prosit's NCE to the experimentally determined data by selecting the top 1000 highest scoring (native search engine score) PSMs and choosing the NCE which results in the highest mean normalized spectral contrast angle [50] (short: spectral angle, SA) between predicted and observed spectra.The NCEs tested span a range from 18 to 49 and calibration is performed individually per MS file.In practice, the most common NCE is around 30-34 NCE units.The NCE determined by this process is then used for later predictions.
Prosit predicts the expected indexed retention time (iRT) based on the PROCAL retention time kit [49] and since most datasets do not contain this specific set of peptides, a calibration of the predicted iRT to the measured RT is performed to allow the calculation of the difference between predicted and observed RT.The calibration is performed on a set of PSMs that survive an initial internal 1% FDR estimate.For FDR estimation, a linear discriminant analysis (LDA) model is trained that discriminates correct from incorrect matches using a set of features based on the predicted spectra, such as the SA.The predicted iRTs for those target candidates are then aligned against experimentally observed RTs using locally weighted scatterplot smoothing (LOWESS).
The parametric solution allows inference of experimentally observed RT as a function of predicted iRTs.The resulting fit is used to calculate ΔRT (difference in predicted to observed RT) which has been shown to improve the separation of correct from incorrect matches [25,51].
In addition to the SA and ΔRT, the number of peaks per spectra that were predicted and observed, only predicted, only observed, or neither predicted nor observed, are counted per PSM and supplied as additional features to subsequent FDR estimation.This is done for all y and b ions separately, as well as for both combined.Because the quality of an SA depends on the number of matching peaks, this is an additional quality measure that helps in correct versus incorrect separation.In addition to this default set of features, an extended set which was proposed to result in more identifications [27] can be used for data-driven rescoring.An in-depth description on the set of features and their interpretation is available in the supplement and can be found online in the Oktoberfest documentation (https://oktoberfest. readthedocs.io/en/stable/usage.html).
Oktoberfest then assembles the input files by combining the PSM-level results of all MS files for FDR estimation, which can be performed with either Percolator or mokapot.The FDR estimation is performed twice, (1) based on all data-driven rescoring features (referred to as, for example, Andromeda+Prosit+Percolator), and (2) based only on the native search engine score (referred to as, for example, Andromeda+Percolator).This is to enable a rough estimation of the gained, shared and lost PSMs and peptide identifications by data-driven rescoring.However, the peptides confidently identified based only on the native search engine score may differ from the original search engine results filtered at 1% (referred to as, for example, MaxQuant 1% FDR).This is because not only may the FDR estimation procedure itself vary between the workflows, but also the set of features, which is often unknown unless the source code or documentation is available, used for separation may vary.In practice, we have observed that for rescoring Andromeda results, the 1% FDR estimation performed by Percolator results in slightly more identifications compared to running MaxQuant at 1% FDR.A detailed description of the generated output files is available online in the Oktoberfest documentation (https://oktoberfest.readthedocs.io/en/stable/usage.html).
When using Oktoberfest for spectral library generation, the third workflow, we strongly encourage users to first perform an NCE calibration on a recently acquired dataset or quality control run to ensure that the generated library contains spectra with optimal similarity.Library generation (Figure 1) requires either a peptides file in .csvformat of the precursors for which spectra are requested, or a .fastafile which is then digested and prepared in silico by Oktoberfest.Two output formats are supported by Oktoberfest, Spectronaut-compatible .csvand .mspfiles, which can be used directly in combination with most spectral library search engines such as SpectraST [52] or Scribe [37], analysis pipelines such as Skyline [53], or DIA workflows such as Spectronaut [54] or DIA-NN [8].

Example of a data-driven rescoring workflow
In order to exemplify the output generated by the data-driven rescoring pipeline of Oktoberfest, we have re-analysed a representative allele (C*12:03) from a recently published monoallelic HLA class I cell line dataset [3], which has been utilized to show the benefits of the rescoring approach [27,31].For rescoring, we used the MaxQuant search results without FDR filtering [31].
Oktoberfest generates a number of quality control plots for each MS file used for rescoring.First, NCE calibration plots are generated that show the prediction performance of Prosit when using varying NCEs including the NCE with the best mean SA over the tested range (Figure 2A).This plot serves as a quality control step since individual drops in MS performance or potential batch effects on biological or technical replicates during acquisition will be visible.For the data on allele C*12:03, the optimal NCE is 32, resulting in a mean SA of 0.86 for the top 1000 best scoring target PSMs.If the optimal NCE for prediction results in an SA of < 0.8, an incorrect model for prediction may have been used.Of note, the mean SA drops substantially when using NCEs that are not calibrated or set correctly, supporting the importance of NCE as a parameter for prediction models and the potential negative impact on the number of identifications [14] when generating/sharing spectral libraries with uncalibrated NCEs across datasets.
Second, RT alignment plots for each MS file are generated, highlighting the alignment of predicted iRTs to experimentally observed RTs including a LOWESS fit for target PSMs/peptides that survive the LDA-based FDR cutoff (Figure 2B).These plots can be used to investigate potential differences in acquisition (e.g., exchange of analytical columns or deteriorating LC performance) or could highlight the use of an incorrect model for iRT prediction.
Oktoberfest further reports summary plots for the entire dataset that was rescored.First, PSM and peptide level plots are generated which compare the target-decoy (correct-incorrect) separation post Percolator when using the native search engine score (Andromeda+Percolator) or the Oktoberfest-derived features (Andromeda+Prosit+Percolator).The results of both Percolator runs are shown in a scatter plot with marginal histograms (Figure 2C).While Percolator was able to decently separate correct from incorrect matches using the Andromeda score only, a much better separation into two distinct distributions is visible for the Oktoberfestderived feature set.This plot also serves as a quality control for FDR estimation as one would expect near perfect overlap of the decoy and low-scoring target PSM/peptide distribution [55].Furthermore, this plot also highlights the origin of the PSMs/peptides rescued by datadriven rescoring.These are typically PSMs/peptides that could not be differentiated from incorrect matches by classic database search engine scores exclusively, visible by the set of target PSM/peptides that overlap with the decoys in Andromeda+Percolator but are well separated in the Andromeda+Prosit+Percolator dimension.Second, additional summary barplots generated on PSM and peptide level are generated highlighting the gained, shared and lost peptides between the two Percolator runs (Figure 2D).A total of 2574 peptides and 14,318 PSMs were uniquely identified by the Oktoberfest-derived feature set, which is equivalent to a gain of ∼ 489% on the peptide and ∼ 296% on the PSM level.At the same time, only 7 peptides and 49 PSMs were not confidently identified anymore when using the Oktoberfest-derived features for rescoring, which is equivalent to 1.3% and 1.0%, respectively, well within the expected number of false positive identifications surviving the 1% FDR cutoff.

Oktoberfest's extended feature set improves upon previous results
Oktoberfest is a reimplementation of the previously used rescoring pipeline, which also makes use of ΔRT as an additional feature.To evaluate the impact and compare the performance of Oktoberfest-based rescoring to the previously published results, we first compare the results of the Oktoberfest-based reanalysis of four alleles (A*11:01, A*31:01, C*12:03, G*12:03) to the previously published results [31].
As previously reported, MaxQuant identified 30,866 PSMs in total in the four analysed alleles (Figure 3A).When running Oktoberfest's rescoring pipeline on the unfiltered Andromeda results, a total of 38,196 and 95,451 PSMs were confidently identified for Andromeda+Percolator and Andromeda+Prosit+Percolator, respectively.This is equivalent to a gain of 10,892 (+24%) and 65,509 (+209%) PSMs and a loss of 3589 (−12%) and 924 (−3%) PSMs in comparison to the MaxQuant results for the two approaches, overall showing a similar increase in coverage as reported earlier.Investigating the differences between the previous results (78,842 PSMs) and Oktoberfest (95,451 PSMs) shows an increase in performance by 21% (Figure 3A).A total of 16,609 PSMs were rescued by Oktoberfest, equivalent to a gain of 9296 (+12%) PSMs and reduction in lost PSMs of 7313 (89% reduction in losses) compared to previous results.These gains are attributed to the addition of the ΔRT feature and the enhanced feature set adopted from MS2Rescore.
To further investigate the differences between MaxQuant, To further show that Oktoberfest is able to reproduce the previously obtained results, we reanalysed the metaproteomics dataset rescored previously [22].Briefly, a metaproteomics sample was searched against increasingly large search spaces (SwissProt Human, SwissProt Human+Bacteria, Swissprot+IGC [56]) to highlight the substantial performance increase when utilizing data-driven rescoring in contrast to native search engine scores for exceptionally challenging search spaces.Both Andromeda+Percolator and Andromeda+Prosit+Percolator continuously increase the number of confidently identified peptides, while MaxQuant struggled with identifying and retaining confident peptide identifications for this challenging use case (Figure 3C).For the biggest search space (Swissprot+IGC), MaxQuant identified 434 PSMs only, in contrast to 114,510 PSMs when using data-driven rescoring.When comparing the smallest to the largest search space a loss in identifications is expected, which is a cumulative effect of the bigger search space resulting in high score cutoffs to maintain a 1% FDR threshold and the replacement of identifications by better matching peptides from the increased search spaces.This is particularly visible for MaxQuant, where searching the data against a human only database initially resulted in the identification of 11,724 PSMs, but only 434 (3.7%) of those remained confident when searched against the Swissprot+IGC database (Figure 3D, left panel).For data-driven rescoring, a total of 19,124 PSMs were identified against a human only reference, of which 15,573 (81%) were retained for the biggest search space (Figure 3D, right panel).This is equivalent to a reduction in losses by 68% (3551 and 11,290 PSMs lost by MaxQuant or data-driven rescoring).This suggests that data-driven rescoring is much less susceptible to a drop in performance due to the increased search space and replacements of initially (human only) confidently but incorrect peptides spectrum matches while at the same time, boosting the number of confidently identified PSMs and peptides.

Performance analysis and system requirements
Analysing the compute requirements of Oktoberfest shows that runtime scales linearly with the number of PSMs (Figure 4).The majority of computation time is spent on MS file conversion, which requires ∼ 40% of the time.Most surprisingly, the time required to predict and retrieve fragment intensities and iRTs is comparable to iRT alignment and feature calculation owing to the usage of a high performance prediction server.Oktoberfest is able to perform parallel processing, splitting computation by MS file.This results in a ∼4-fold reduction in processing time (∼1 h single process vs. ∼15 min multi processing)  for a total of 15 MS files.However, since FDR estimation needs to be performed on the combined results in order to maintain accurate global peptide and protein FDR estimates, a considerable amount of time is attributed to Percolator even when utilizing parallel processing.
In this case, Percolator contributed 5 min (∼ 33%) to the total runtime for the combined rescoring of the three alleles.On average, we observed a memory usage of ∼2 GB per processing job without parallel processing, that is, per MS file.As a result, we recommend, for average datasets, approximately 3 GB of memory per process, which allows the parallel processing of 3-4 MS files on a standard notebook with 16 GB memory.

CONCLUSION AND OUTLOOK
We Oktoberfest was developed using modern continuous integration and deployment standards enabling the easy and straightforward integration of new features (e.g., support for cross-linked peptides) and performance enhancements (e.g., for the fragment peak annotation).
Its flexible design allows the exchange of currently used functionalities by community developed alternatives, for example, psm_utils [57] as an alternative to spectrum_fundamentals, which we are planning to adopt in a future release.Furthermore, the addition of additional downstream processing tools, such as SIMSI-Transfer [58] and the picked-protein group FDR approach [59], will further extend the use cases and increase usability beyond identification-driven rescoring.
Predictions for rescoring and library generation are retrieved in a model-agnostic format utilizing a central or local prediction server.
This paves the path towards an ecosystem in which other existing (e.g., DeepLC and AlphaPeptDeep) and novel prediction models could be integrated to compare and evaluate their performance in wellestablished rescoring workflows.We anticipate that this will not only facilitate rapid model development and testing for developers, but will also enable users to benefit from the most recent ML/AI developments.
This approach may also circumvent hardware or scalability concerns of users as the prediction services are operated and maintained remotely, enabling users and tools to retrieve predictions anytime and anywhere without the need of costly, dedicated, and hard to maintain AI/ML software and hardware.

F I G U R E 2
Output of Oktoberfest exemplified on a data-driven rescoring performed on an HLA measurement (allele C*12:03).(A) Normalized collision energy (NCE) calibration curve for the mass spectrometry (MS) file GG20170112_CRH_HLA_C1203_biorep1_techrep2 depicting the mean spectral angle (SA) when comparing experimentally observed spectra of the top 1000 highest scoring target peptide-spectrum matches (PSMs) to spectra predicted with Prosit at NCEs of 19-49 (in steps of 1).Optimal agreement (calibration) is reached at NCE 32 (vertical red line).(B) Alignment of predicted indexed retention time (iRT) (x-axis) to observed RT (y-axis) for all confidently identified (1% FDR) target PSMs (dots) from the MS file GG20170112_CRH_HLA_C1203_biorep1_techrep2.The color shades indicate regions of low (blue) and high (yellow) density.The solid black line indicates the fitted RT mapping using a locally weighted scatterplot smoothing (LOWESS) fit on the confidently identified (1% FDR) target PSMs determined using a linear discriminant analysis (LDA) model prior to final false discovery rate (FDR) estimation.(C) Comparison of target-decoy separation on peptide level using the Percolator scores after rescoring of Andromeda results only (x-axis, Andromeda+Percolator) and Oktoberfest-derived features (y-axis, Andromeda+Prosit+Percolator).Green and orange dots represent individual target and decoy peptides, respectively.The marginal distributions show the respective normalized density distributions of target and decoy identifications.(D) Overview of gains and losses when using Andromeda+Prosit+Percolator features for rescoring instead of Andromeda+Percolator features for all targets below 1% FDR on the peptide level (left bar) and PSM level (right bar).Target PSMs and peptides below 1% FDR that are retained only when using Andromeda+Prosit+Percolator features for rescoring are shown in green while target PSMs/peptides lost are shown in red.The blue portions show the target PSMs/peptides that are identified by both.
Andromeda+Percolator and Andromeda+Prosit+Percolator, we compared the three results on PSM and peptide level.The Venn diagram on PSM level (Figure 3B, left panel) highlights, that of the 10,892 PSMs added by Andromeda+Percolator in comparison to MaxQuant, 10,579 (97%) are also confidently identified when applying data-driven rescoring.Only 313 PSMs were uniquely identified by Andromeda+Percolator. Of the PSMs added by Andromeda+Prosit+ Percolator, a large subset of 2859 PSMs were not confidently identified using Andromeda+Percolator and only 730 PSMs were uniquely identified by MaxQuant.A similar pattern is visible at peptide level (Figure 3B, right panel).While Andromeda+Percolator adds a total of 1080 peptides, Andromeda+Prosit+Percolator shares 1028 (95%) of these and further increase the gains by 1385 peptides that were not confidently identified by Andromeda+Percolator.

3
Evaluation of Oktoberfest rescoring versus previously published results.(A) Comparison of identified target peptide-spectrum matches (PSMs) of the four investigated human leukocyte antigen (HLA) alleles (A*11:01, A*31:01, C*12:03, G*01:03) using Andromeda+Percolator, Andromeda+Prosit+Percolator (original publication) and Andromeda+Prosit+Percolator against MaxQuant at 1% false discovery rate (FDR) (left bar).The blue bars represent PSMs shared with MaxQuant, the green and red bars, the gains and losses, respectively.(B) Venn diagrams showing the overlap of identified target PSMs below a 1% FDR between Andromeda+Prosit+Percolator (light blue), Andromeda+Percolator (blue) and MaxQuant with 1% FDR cutoff (dark blue) on PSM and peptide level for the HLA alleles in (A).(C) Gains and losses of targets on the PSM level for the metaproteomics data for increasing database sizes (left to right): SwissProt Human (purple), SwissProt Human + Bacteria (yellow), and SwissProt + IGC (orange).Individual triplets of bars show the original MaxQuant results, Andromeda+Percolator and Andromeda+Prosit+Percolator at 1% FDR.Blue bars represent shared PSMs, green and red parts show gains and losses, respectively, compared to MaxQuant at 1% FDR.(D) Venn diagrams showing the overlap of identified target PSMs below a 1% FDR between the three databases (colors as in (C)) when using MaxQuant only (left Venn diagram) and Andromeda+Prosit+Percolator (right Venn diagram).

4
Runtime evaluation of data-driven rescoring in Oktoberfest.Stacked barplots show the individual (by processing steps) and combined (horizontal bar) runtime (y-axis) of six rescoring runs with varying number of peptide-spectrum matches (PSMs) (x-axis) performed with Oktoberfest on combinations of different alleles from the selected human leukocyte antigen (HLA) dataset.The x-axis shows the rescored allele followed by the number of PSMs.Horizontal black bars indicate the combined runtime with n ( = number of .RAW files, see x-axis description of runs) parallel jobs performing the rescoring.
presented Oktoberfest, a flexible and versatile open-source Python package for data-driven rescoring and in silico spectral library generation based on the deep learning model Prosit.Particularly for rescoring, Oktoberfest excels when analysing big search spaces (e.g., immunopeptidomics), challenging datasets with high dynamic rangeor low sample input (e.g., body fluids and single cells), full proteome analysis using non-tryptic proteases (e.g., chymotrypsin) and complex mixtures of proteomes (e.g., metaproteomics).We show the benefits of Oktoberfest-based rescoring for challenging datasets which can be performed without the limitations of the web-interface.Oktoberfest supports different fragmentation methods, mass analysers, and search engine inputs, consistently resulting in an increase of confidently identified peptides in comparison to classic approaches.Across all use cases tested so far, Oktoberfest increased peptide coverage by an average of 50% (up to hundreds of percent) and protein coverage by an average of 25%.
Overview of the data processing pipeline in Oktoberfest.Oktoberfest utilizes spectrum-io for input/output and file conversion operations and spectrum-fundamentals for spectrum annotation and similarity calculations.The main three use-cases of Oktoberfest are (1) the generation of spectral libraries (blue arrows), (2) NCE calibration, and (3) data-driven rescoring (both black arrows).For library generation, a list of peptides (.peptides) or reference proteome (.fasta) for digestion is required, which are predicted with Prosit and provided in various formats.For rescoring, mass spectrometry (MS) files (e.g., RAW), and search engine results (e.g., msms.txt) are required, which are converted to an internal representation (.prosit and .mzML).Oktoberfest performs spectrum annotation, normalized collision energy (NCE) calibration, and indexed retention time (iRT) alignment to generate an extensive set of features used for false discovery rate (FDR) estimation (via Percolator or mokapot).
More details on the features and output formats can be found online in the Oktoberfest documentation (https://oktoberfest.readthedocs.io/en/stable/usage.html).