Database search engines and target database features impinge upon the identification of post‐translationally cis‐spliced peptides in HLA class I immunopeptidomes

Abstract Unconventional epitopes presented by HLA class I complexes are emerging targets for T cell targeted immunotherapies. Their identification by mass spectrometry (MS) required development of novel methods to cope with the large number of theoretical candidates. Methods to identify post‐translationally spliced peptides led to a broad range of outcomes. We here investigated the impact of three common database search engines – that is, Mascot, Mascot+Percolator, and PEAKS DB – as final identification step, as well as the features of target database on the ability to correctly identify non‐spliced and cis‐spliced peptides. We used ground truth datasets measured by MS to benchmark methods’ performance and extended the analysis to HLA class I immunopeptidomes. PEAKS DB showed better precision and recall of cis‐spliced peptides and larger number of identified peptides in HLA class I immunopeptidomes than the other search engine strategies. The better performance of PEAKS DB appears to result from better discrimination between target and decoy hits and hence a more robust FDR estimation, and seems independent to peptide and spectrum features here investigated.

can strongly impinge upon the identified peptide pools. For instance, it has been shown that elution strategies affect peptide yields and create a bias in detected sequence repertoire (Nicastri et al., 2020). It is also generally accepted, and confirmed by various groups (Bichmann et al., 2019;Parker et al., 2021), that the number and features of identified peptides in canonical HLA-I immunopeptidomes strongly vary depending on the search engines that are used in the analysis of MS measurements.
Proteasome-generated cis-spliced epitopes were identified for the first time in 2004 (Hanada et al., 2004;Vigneron et al., 2004). They can target a CD8 + T cell response in vivo against bacterial antigens, which would be neglected by these T cells in the absence of cis-spliced epitopes (Platteel et al., 2017). They can also activate CD8 + T cells through cross-recognition of pathogen-derived non-spliced epitopes (Paes et al., 2019;Platteel et al., 2016). Cis-spliced epitopes derived from melanoma-associated antigens are recognised by CD8 + T cells in peripheral blood of melanoma patients (Ebstein et al., 2016;Faridi et al., 2020), and can be successfully targeted by adoptive T cell therapy in melanoma patients (Dalet et al., 2011;Robbins et al., 1994). Cisspliced epitopes could carry tumour-specific mutations  and seem to drive the immune response triggered by synthetic peptide vaccination in a mouse model of glioblastoma (Fidanza et al., 2021). Although trans-spliced peptides are produced in vitro by proteasomes (Berkers et al., 2015;Dalet et al., 2010;Liepe et al., 2010;Mishto et al., 2012;Specht et al., 2020) and detected in HLA-I immunopeptidomes (Faridi et al., 2018), their immunological relevance still need to be confirmed and thus they were not included in this study.
Despite this seminal evidence of the immunological relevance of spliced peptides in HLA-I immunopeptidomes, their frequency is still a controversial issue. After the publication of the first method for the identification of cis-spliced peptides in HLA-I immunopeptidomes , other groups published alternative methods and obtained contrasting results about their frequency (Admon, 2021;Faridi et al., 2021;Mishto, 2020;Purcell, 2021). Two recent studies reanalysed the list of cis-spliced peptides, which we previously identified in HLA-I spliced immunopeptidomes using Spliced Peptide Identifier (SPI) and SPI-delta methods , by using PEAKS X software. They identified many peptide-spectrum matches (PSMs) that, according to their analysis (Erhard et al., 2020;Lichti, 2021), have been wrongly assigned to cis-spliced peptides in the original papers.
Since both SPI and SPI-delta methods used Mascot as the database search engine , Erhard and colleagues (Erhard et al., 2020) hypothesised that this phenomenon could in part be due to difference in PSM assignments by Mascot and PEAKS DB. Such hypothesis was shown to be correct for HLA-I non-spliced immunopeptidomes (Bichmann et al., 2019).
To verify whether this hypothesis was correct also for HLA-I cisspliced immunopeptidomes, we here implemented two methods that, despite using a similar pipeline, reported a discordant range of cisspliced peptide frequencies in HLA-I immunopeptidomes, that is, Bassani-Sternberg's method (MBS) (Mylonas et al., 2018) and Purcell's method (AP) (Faridi et al., 2018(Faridi et al., , 2019, and applied them using either Mascot or PEAKS DB as final search engine. In the case of Mascot, we also applied the Percolator post-processing tool (i.e., Mascot+Percolator). Percolator is a semi-supervised machine-learning algorithm used to increase the number of peptides identified at a given FDR threshold and it is often used as post-processing step of Mascot .
We tested Mascot, Mascot+Percolator and PEAKS DB on ground truth HLA-I immunopeptidome datasets to compute search engine performance and to understand how performance is associated to the features of target databases, as well as on experimental HLA-I immunopeptidome datasets to measure non-spliced and cis-spliced peptide yield.

Cell lines
All cell lines were mycoplasma-negative and cultured in 5% CO 2 atmosphere at 37 • C. K562-B*07:02 and K562-A*02:01 cell clones express single HLA-I alleles. They derive from the leukaemia K562 cell line (ATCCCCL-243), which does not express endogenous HLA-I and -II molecules. The K562-A*02:01 clone was generated as described elsewhere (Eichmann et al., 2014). Briefly, the HLA-I allele was cloned and inserted into a pcDNA3.1 vector, transfected into K562 cell clone, which was then geneticin-selected and periodically single-cell sorted.
The K562-B*07:02 clone was generated by Lorenz et al. (2017)  F I G U R E 1 Proteasome-generated spliced peptides and analysis workflow. (A) Proteasome-generated spliced peptides can be formed by: i) cis PCPS, when the two splice-reactants, that is, the non-contiguous peptide fragments ligated by proteasomes, derive from the same polypeptide molecule; the ligation can occur in normal order, that is, following the orientation from N-to C-terminus of the parental protein (normal cis-PCPS), or in the reverse order (reverse cis-PCPS); ii) trans-PCPS, when the two splice-reactants originate from two distinct protein molecules or two distinct proteins. Extracted total RNA was sequenced and processed by GENEWIZ Inc.
After polyA enrichment, mRNA was fragmented, and cDNA was pro-duced using NEBNext Ultra RNA Library Preparation Kit with random priming. Sequencing was performed using HiSeq 2 × 150 PE HO with the depth of 20-25 million reads per sample. Reads were trimmed using Trim Galore with stringency parameter of 5. Quantification was performed using Salmon (v1.1.0) (Patro et al., 2017) with decoyaugmented GENCODE v33 human reference transcriptome (Frankish et al., 2019). Salmon selective alignment mode was shown to improve the transcript quantification accuracy and can be used together with sample-specific GC-content, position and sequence bias models (Srivastava et al., 2019). In order to further enhance the sensitivity, particularly in short transcripts (Wu et al., 2018) the k-mer size was reduced to 23 bp and 1,000 Gibbs samples were drawn from the posterior distribution of transcript abundances. To take advantage of Gibbs sampling and to correct for gene-length bias, the tximport R package (Soneson et al., 2015) was used to import transcript quantification results and to scale the resulting transcript per million values using median transcript length amongst gene isoforms, and then the library size (dtuScaledTPM). Only the transcripts that have received more than 10 estimated counts in at least one sample were considered to be expressed and their GENCODE protein-coding transcript translation sequences were selected for a common database for MS search.

Mass spectrometry
MS data generated for this project were collected using Orbitrap  spectrometer. In the analysis with Mascot+Percolator, we used Percolator 3.0.5. The feature set for Percolator was more limited for these datasets than for a standard tryptic digest due to the non-specific cleavage of proteasomes relative to trypsin and the fact that protein accession features were not applicable for de novo discoveries. The features used for Percolator were Mascot Ion Score, the percentage differences between the Ion Score of the PSM and the second and fifth highest ion scores for the same scan, the difference between the theoretical and experimental precursor mass and its absolute value, the length of the peptide, one hot encoding of the precursor charge, and one hot encoding of the post translational modifications which occurred in more than 10% of samples.

2.3.1
Method workflows for the identification of cis-spliced and non-spliced peptides ( Figure 1B The implementation of the methods here applied and deviations from their original version, which were necessary either for technical reasons or to provide a consistent and comparable workflow for all methods, are the following: AP method (Faridi et al., 2018): MS data was first searched against a reference human proteome database using PEAKS DB. Mass spectra not assigned as non-spliced peptides with 1% FDR were searched using PEAKS De novo. For the following analysis, the top 5 de novo candidate sequences per MS2 spectrum with an Average Local Confidence (ALC) score larger than a computed cut-off were exported. The ALC cut-off was determined based on the ALC distribution of nonspliced peptides, which were identified both via PEAKS DB at 1% FDR and PEAKS De novo. Among the top 5 de novo sequences within ALC scores above the cut-off, all sequences were aligned to all possible nonspliced peptides. If a potential non-spliced peptide was detected, all respective de novo candidates were discarded from further analysis.
Otherwise, if a potential cis-spliced peptide was detected, all remaining respective de novo sequences were discarded, and the cis-spliced peptide with the highest ALC score was extracted. If no potential cis-spliced peptide sequence was amongst the de novo sequences, the sequence was aligned to all possible trans-spliced peptides. Again, amongst all possible trans-spliced peptide sequences the one with the highest ALC score was extracted. If non, cis-and trans-spliced sequences could not be found, the MS2 spectrum was not further considered. Finally, per MS2 spectrum, a maximum of one spliced peptide candidate (either cis or trans) was extracted. All extracted spliced peptide candidates were concatenated into in silico proteins, which were appended to the reference proteome database, thereby generating a target database. This target database was then used to re-search the MS dataset using PEAKS DB. Identified peptides were extracted at 1% FDR as determined by PEAKS. To note, in our implementation, we used the PEAKS file 'all-de-novo-candidates.csv' , which Dr. Faridi confirmed being the actual file used in AP method (Faridi et al., 2018(Faridi et al., , 2020, rather than the 'de-novo-only peptides.csv' , which was the file reported in Faridi et al. (2018). (Mylonas et al., 2018): MS data were first searched against a reference human proteome database using PEAKS DB. MS2 spectra not assigned as non-spliced peptide with PEAKS score -log10P larger than 15 were re-searched using PEAKS De novo. The top five de novo candidate sequences per MS2 spectrum, were exported and aligned to all possible cis-spliced peptides with an intervening sequence not longer than 25 amino acids (rather than 20 amino acids as in the original study). A PEAKS local confidence score (LCS) of at least 80 was required for each amino acid in the peptide candidates. Remaining de novo candidates that could be explained by cis-peptide splicing were appended to the human reference proteome, thereby generating a target database. This target database was used to re-search the MS data. Resulting peptides were filtered for 1% FDR. In their original study, Mylonas et al. (2018) employed Andromeda in the MaxQuant framework as their final search engine. In our study, we were not able to robustly analyse MS data in the open mzML or mzXML formats with MaxQuant. Furthermore, for the generation of the constructed ground truth datasets, MS1 and MS2 spectra from various MS runs were collected and merged into a single mzML file, which interfered with the indexing of those files, making them uninterpretable by MaxQuant. Therefore, MaxQuant could not be used for benchmarking, and we decided to exchange the Andromeda search engine in the MaxQuant framework either with the stand-alone Mascot search engine or Mascot+Percolator (Mascot+Percolator). Both Andromeda and Mascot search engines operate on comparable algorithms, and analysis results were often found to be comparable in terms of precision and recall of non-spliced peptides (Bichmann et al., 2019;Paulo, 2013 If the two splice reactants match to two different proteins, the candidate sequence could be generated through trans-splicing. If more than one possible explanation is found, AP and MBS methods implemented a hierarchy, which prefers non-spliced over any spliced peptide and prefers cis-over trans-spliced peptides.

2.3.3
Generation of constructed HLA-I immunopeptidome datasets and reference database for benchmarking framework ( Figure 1C) To determine the performance of Mascot, Mascot+Percolator and PEAKS DB within MBS and AP method framework, we applied the methods to ground truth datasets using constructed databases.
The strategy for the generation of a ground truth dataset of MS1 and MS2 spectra, that resembled the characteristics of HLA-I bound peptides, followed the following steps: we used the MS RAW datasets For each peptide, we selected exactly one MS2. If more than one MS2 spectrum was assigned to the same peptide sequence, we selected one MS2 spectrum randomly and discarded the remaining MS2 spectra.
The selected MS2 spectra and their corresponding MS1 spectra were merged in the mzML format to a new constructed HLA-I immunopeptidome dataset. Resulting mzML files were tested for validity in both PEAKS DB and Mascot to check if all selected spectra were indeed correctly annotated (Table S1). The resulting three datasets represent the constructed ground truth HLA-I immunopeptidome datasets and were employed in the benchmarking framework using constructed reference databases. The latter were generated by modifying the reference human proteome database so that the target sequences included in the benchmarking dataset could be identified only as one of the three categories: non-spliced peptides, cis-spliced peptides with intervening sequence shorter than 26 amino acids, or trans-spliced peptides. Since only non-spliced and cis-spliced peptides could be identified by MBS method and we aimed to focus our analysis of the search engine performance in the identification of cis-spliced peptides, we only considered non-spliced and cis-spliced peptides in the respective algorithm's performance analysis. Thereby, in the constructed HLA-I immunopeptidome datasets, MS2 spectra derived from the pool of defined trans-spliced peptides represented potentially high-quality spectra, whose corresponding peptide sequence is not encoded in most database search strategies, and which could only be identified by exploring extremely large sequence search spaces.
We considered those MS2 spectra as 'trapping spectra' and the corresponding peptides as 'trapping peptides' . AP method may have an advantage when challenged with those trapping spectra, because it also considers the possible occurrence of trans-spliced peptides.
However, this potential advantage may come at the cost of overall less accurate peptide identification.
To produce a constructed reference database: (i) we modified the Uniprot standard proteome database by replacing all isoleucine (I) by leucine (L), that is, removing I/L redundancies that cannot be solved by MS; (ii) we split target sequences into the three peptide categories in equal proportion; (iii) in constructed reference database, we removed target sequences from the original reference database by replacing all substrings identical to a target peptide in the original reference database by a randomly sampled peptide sequence; (iv) we added non-spliced, cis-and trans-spliced target sequences to the constructed reference databases by randomly sampling proteins from the reference database and appending the target sequences to either C-or N-terminus of proteins. For cis-and trans-spliced peptides we randomly sampled a splice-site within each target sequence, split the target sequence into two splice-reactants and append the splice-reactants either to two different proteins (for trans-spliced peptides) or to the same protein after including a random intervening sequence between the two splice-reactants (for cis-spliced peptides).
After each iteration of appending, we tested that none of the other target peptide sequences could be assigned to a different peptide category.

MS2 spectra characteristics
For each MS2 spectrum in our ground truth datasets we computed MS2 spectra characteristics (relative ion coverage and signal-tonoise ratio) and corresponding peptide characteristics (peptide length, hydrophobicity index).

Statistical analysis
All statistical analysis has been implemented in R. All statistics for performance measurement are described in the benchmarking framework. FDR calculation is described in the respective methods sections.

Evaluation of Mascot, Mascot+Percolator and PEAKS DB performance in cis-spliced peptide identification in ground truth datasets
To evaluate the performance of Mascot, Mascot+Percolator and PEAKS DB as final search engines for identifying cis-spliced peptides in HLA-I immunopeptidomes, we implemented MBS and AP methods, using the three final search engine strategies. These two methods also generated target databases (FASTA files with canonical proteome and appended target spliced peptide sequences), which were used for the database search step (see also points 1 and 2 of Figure 1B). The target databases may have different features, which might impinge upon the final search engine performance.
In order to determine the methods' performance in combination with either Mascot, Mascot+Percolator or PEAKS DB in terms of precision and recall, we applied them to ground truth datasets using con-  (Table S1).
The constructed reference databases were generated by modifying the reference human proteome database so that we knew which non-spliced, cis-spliced and trapping (unidentifiable) peptides were present in these ground truth datasets (see also Materials and Methods section). Since all target peptides were, in reality, non-spliced peptide sequences derived from the canonical human proteins, we had to modify the reference database so that one-third of the peptides could be identified only as non-spliced peptides, one third only as cis-spliced peptides (with intervening sequences shorter than 26 residues) and one third of the peptides could not be identified. The latter aimed to mimic the portion of HLA-I immunopeptidomes that is neither non-spliced nor cis-spliced peptides, such as other potentially unknown peptide sequences not encoded directly in the human proteome database. In our benchmarking framework, we defined those unidentifiable peptides as trans-spliced peptides, since these peptides could not be identified by MBS method. In our analysis, unidentifiable peptides represented the large pool of unconventional peptides -for example, those derived from alternative open reading frames (ORFs), intronic or intergenic regions, single amino acid mutations as well as trans-spliced peptides (Erhard et al., 2020;Faridi et al., 2018;Laumont et al., 2016Laumont et al., , 2018Ruiz Cuevas et al., 2021) -and that cannot be identified with standard immunopeptidomics strategies. We chose this equal proportion of non-spliced, cis-spliced and unidentifiable peptides in the constructed reference databases bearing in mind the information gathered from the largest database of non-spliced, cis-spliced and transspliced peptide products identified via MS in in vitro digestions of synthetic polypeptides (Specht et al., 2020), and some studies detecting other unconventional peptides in HLA-I immunopeptidomes (Erhard et al., 2020;Faridi et al., 2018;Laumont et al., 2016Laumont et al., , 2018Ruiz Cuevas et al., 2021).
This strategy relying on constructed ground truth HLA-I immunopeptidome datasets and cognate reference databases allowed a robust benchmark since we knew which non-spliced, cis-spliced and unidentifiable peptides were present in these ground truth datasets; thereby, we could directly compute precision and recall (PR) curves ( Figure 1C).
We then applied AP and MBS methods -using either Mascot, Mascot+Percolator or PEAKS DB as final search engine -to each con- By contrast, the method performances were strongly reduced in the identification of cis-spliced peptides in all three constructed ground truth HLA-I immunopeptidome datasets, as compared to non-spliced peptide identification. The recall for the identification of these cisspliced peptides was limited, especially in the analysis carried out with MBS method. Also, overall, AP method showed a lower precision than MBS method. For both methods, precision of the identification of cisspliced peptides was strongly impaired by applying Mascot as the final search engine. Mascot performance in the identification of cisspliced peptides did not benefit from the addition of Percolator, with the exception of constructed ground truth HLA-A*02:01 immunopeptidome datasets using AP method framework (Figure 2A-C). This difference in precision between peptide identification strategies was reflected in the misassignment of MS2 spectra that corresponded to unidentifiable peptides in the constructed reference databases. MBS method showed a similar number of unidentifiable peptides' MS2 spectra wrongly assigned to non-spliced and cis-spliced peptide sequences, which was higher when we applied Mascot+Percolator as final search engine ( Figure 3A-C). In contrast, AP method wrongly assigned more unidentifiable peptides' MS2 spectra to cis-spliced than non-spliced peptide sequences ( Figure 3A-C), which mirrored the lower precision F I G U R E 2 Performance of the three database search engine strategies for the identification of non-spliced and cis-spliced peptides in constructed ground truth HLA-I immunopeptidome datasets. (A-C) Performance of the three database search engine strategies tested in the constructed ground truth HLA-I immunopeptidome datasets of K562-B*07:02 (A), K562-A*02:01 (B) and 721.221-A*02:01 (C). The original datasets were obtained through measurement by Orbitrap Fusion Lumos (A, B) or Q-Exactive Plus (C). PR curves for the identification of non-spliced and cis spliced peptides in constructed HLA-I immunopeptidomes. PR curves report precision -i.e., number correctly identified peptides over number identified peptides -on the Y axis and recall -i.e., number correctly identified peptides over number correct peptides -on the X axis, are reported. Curves represent the performances by applying a range of scoring cut-off. Number of true peptides present in each category is reported, which is a portion of the whole number peptides in the constructed HLA-I immunopeptidome datasets of K562-B*07:02 (n = 1556), K562-A*02:01 (n = 1668) and 721.221-A*02:01 (n = 1257) of AP method, regardless of the search engine applied, in the identification of cis-spliced peptides in the three constructed ground truth HLA-I immunopeptidome datasets (Figure 2A-C).

3.2
The features of the target databases rather than the PSMs impinge on search engine performance MS2 spectra characteristics -such as ion coverage and signal-tonoise ratio -as well as peptide characteristics -such as length and hydrophobicity -may impinge upon both precision and recall of iden- All cis-spliced peptide candidates are all cis-spliced peptide candidates present in a target database ( Figure 5A). As a first analysis, we computed the ratio of true cis-spliced peptide candidates over all cis-spliced peptide candidates included in each method's target database for all three constructed ground truth HLA-I immunopeptidome datasets.
The higher this ratio, the more informed is the target database and, therefore, the easier it is to reach high precision, that is, the easier it is to identify true cis-spliced peptide sequences ( Figure 5B). Furthermore, we computed the ratio of true cis-spliced peptide candidates included in each method's target database over all true cis-spliced peptides in a constructed ground truth HLA-I immunopeptidome dataset. The lower this ratio is, the more of the true target sequences are missing in a target database, which hinders their identification and, hence, results in low recall ( Figure 5C). Finally, we analyse the size of the spliced peptide target databases for both, AP and MBS method ( Figure 5D). The target database size is here defined as the number of De novo candidates that have been included as spliced peptide candidates in the final database search. According to our analysis, the number of true cis-spliced peptide candidates in a target database represented a sizeable portion of the cis-spliced peptide candidate in the same target database. MBS method, however, consistently showed a higher ratio of true cis-spliced peptide candidates over all cis-spliced peptide candidates in the target F I G U R E 4 Performance of the three database search engine strategies in association with features of peptides and their MS2 in constructed ground truth HLA-I immunopeptidome datasets. (A-D) The analysis was done by merging the results obtained by applying either Mascot, Mascot+Percolator or PEAKS DB within the framework of AP and MBS methods to the ground truth HLA-I immunopeptidome datasets derived from whole HLA-I immunopeptidomes of K562-B*07:02, K562-A*02:01 and 721.221-A*02:01 cell lines. To have a dataset large enough to investigate peptides and MS2 features, peptides of all three constructed ground truth HLA-I immunopeptidome datasets were analysed together. The analysed features are peptide hydrophobicity (A), peptide length (B), MS2 coverage (C) and MS2 signal-to-noise ratio (D). Peptides have been grouped based on their type -i.e., non-spliced (A) and cis-spliced peptides (B-D) -and whether they were assigned with the correct sequence, misassigned or not assigned at all. Violin plots describe the value distribution. Median is depicted with a longitudinal line databases ( Figure 5B). In addition, many true cis-spliced peptides present in the constructed ground truth HLA-I immunopeptidome datasets were overlooked by AP and, even more pronounced, by MBS methods ( Figure 5C). Furthermore, the target databases for AP methods consist of approximately six times as many spliced peptides compared to the target database of MBS method ( Figure 5D). These features of the target databases generated by both methods correlated well with the performance in the identification of cis-spliced peptides in the constructed ground truth HLA-I immunopeptidome datasets.
Indeed, on the one hand, the method that had the highest ratio of true cis-spliced peptide candidates over all peptide candidates in the target databases ( Figure 5B) -that is, MBS method -had also the highest precision of the identification of cis-spliced peptides in the constructed ground truth HLA-I immunopeptidome datasets (Figure 2A-C). On the other, the method that had the highest ratio of true peptide candidates in a target database over all true peptides present in the constructed ground truth HLA-I immunopeptidome datasets ( Figure 5C) -that is, AP method -had the highest recall of the identification of cis-spliced peptides in the constructed ground truth HLA-I immunopeptidome datasets (Figure 2A When applying PEAKS DB as final search engine the FDR-PSMs curves were flat until a certain scoring threshold was reached, after which the estimated FDRs increased strongly ( Figure 6). This allowed to determine a reliable scoring threshold for 1% FDR. On the contrary, when applying either Mascot or Mascot+Percolator as final search engine, the estimated FDRs increased steeply with increasing number of assigned PSMs, which hindered a reliable FDR estimation ( Figure 6).
Mascot+Percolator had a better FDR computation than Mascot alone. This is in line with previous studies investigating the impact of Percolator on Mascot performance in trypsin digestion samples. The FDR computation difference between Mascot and Mascot+Percolator in our HLA-I immunopeptidomes seemed less striking than reported by other in trypsin digestions (Kall et al., 2007).
After this preliminary analysis, we applied MBS and AP methods using the generated target databases, which included both non-spliced and cis-spliced peptides, and either Mascot, Mascot+Percolator or PEAKS DB as final search engine strategy. In agreement with that shown in Figure 6, when applying PEAKS DB as final search engine the FDR-PSMs curves were flat until a certain scoring threshold was reached, after which the estimated FDRs increased strongly (Figure 7), which allowed to determine a reliable scoring threshold for 1% FDR. By contrast, when applying either Mascot or Mascot+Percolator as final search engine strategies, the estimated FDR-PSM curves showed similar behaviours to what was observed for the standard non-spliced peptide identification pipeline (see Figure 6), which hindered a reliable 1% FDR estimation (Figure 7). Such a phenomenon could also indicate that F I G U R E 7 Association between identified non-spliced and cis-spliced PSMs and FDR estimation by applying different database search engine strategies to whole HLA-I immunopeptidome datasets. PSMs identified for a range of estimated FDRs computed by applying either Mascot, Mascot+Percolator or PEAKS DB as final search engine of AP and MBS methods to K562-B*07:02 and K562-A*02:01 HLA-I immunopeptidomes.
The identified PSMs accounted for both true and false assignments. The results have been obtained by applying the methods and search engines with target databases accounting for both non-spliced and cis-spliced peptides Mascot and Mascot+Percolator may be less able to distinguish true from false PSM assignments in these kinds of samples combined with larger, non-specific sequence search spaces.
For both MBS and AP methods, keeping a small estimated FDR, we have identified more PSMs by applying PEAKS DB as final search engine rather than either Mascot or Mascot+Percolator (Figure 7). Within the framework of MBS method, the number of cis-spliced peptides identified by applying the three final search engine strategies was similar. The frequency of cis-spliced peptides (with intervening sequence smaller than 26 amino acids) varied between 0.5% and 1.6% ( Figure 8A; Table S2-S4).

Identification of non-spliced and cis-spliced
By contrast, within the framework of AP method, which showed higher recall and lower precision of cis-spliced peptides in constructed ground truth HLA-I immunopeptidome datasets than MBS method (Figure 2), the number of cis-spliced peptides identified by applying PEAKS DB was larger than that identified by applying either Mascot or Mascot+Percolator as final search engines. The frequency of cis-spliced peptides (with intervening sequence smaller than 26 amino acids) varied between 4.6% to 15.0% ( Figure 8B; Table S2-S4). These frequencies of cis-spliced peptides identified by the different implementations of MBS and AP methods were consistent with that published by the cognate research groups (Faridi et al., 2019(Faridi et al., , 2018(Faridi et al., , 2020Mylonas et al., 2018).  Table S2-S4). A similar behaviour was observed for cis-spliced peptides within the framework of AP method ( Figure 8B). The low number of identified cis-spliced peptides by MBS method did not allow any conclusion with that method ( Figure 8A). This hypothesis was confirmed at PSM level. For example, around 70% of the PSMs assigned as cis-spliced peptides by AP method using Mascot-Percolator were assigned as such using PEAKS DB. In addition, most of the PSMs assigned as cis-spliced peptides by AP method using PEAKS DB were not assigned to any peptide sequence using Mascot-Percolator as final search engine. Similarly, most of the PSMs assigned as non-spliced peptides by AP method using Mascot-Percolator were assigned as such using PEAKS DB. In addition, half of the PSMs assigned as non-spliced peptides by AP method using PEAKS DB were not assigned to any peptide sequence using Mascot-Percolator as final search engine (Table S5).

DISCUSSION
The  (Figures 2 and 5). The different performance between these three final search engine strategies did not seem to depend on either peptide and MS2 features or mass spectrometer (Figures 2 and 4). Rather, the analysis of the association between number of assigned PSMs and FDR in whole HLA-I immunopeptidomes, hinted towards a more efficient FDR estimation algorithm of PEAKS DB compared to Mascot, which was only partially improved by adding Percolator to Mascot (Figures 6 and 7).
The outcome of this analysis might impinge upon the identification of other unconventional peptides in immunopeptidomics. Indeed, posttranslationally spliced peptides might be the vaster pool of unconventional peptides bound to HLA complexes but they are not the only one.
Other PTMs are frequent in immunopeptidomes as well as cryptic peptides derived from, for example, putative non-coding regions of the genome (Erhard et al., 2020;Laumont et al., 2018;Ruiz Cuevas et al., 2021). All unconventional peptide pools have the same inherent characteristic of enlarging the sequence search space compared to canonical non-spliced peptides. Inevitably, this required search engines to accurately distinguish true PSMs from false PSMs due to potentially very high sequence similarity between true and false hits. Additionally, the larger a target database was, the lower the ratio of true peptide sequences over all entries in a target database, and hence it would be harder to identify true PSMs.
PEAKS DB reduced this issue through its de novo assisted decoyfusion strategy. PEAKS DB prefiltered the user-provided reference database keeping only the top 8000 entries, which have a required number of de novo sequencing-based sequence-tags. This made not only the actual database search (PEAKS DB) efficient, but also reduced the final search space. Furthermore, PEAKS DB employed a decoyfusion strategy, whereby decoy sequences (inverted target sequences) were appended to each target entry in the database, thereby allowing for FDR estimation despite a two-round search.