Removing the Hidden Data Dependency of DIA with Predicted Spectral Libraries

Data‐independent acquisition (DIA) generates comprehensive yet complex mass spectrometric data, which imposes the use of data‐dependent acquisition (DDA) libraries for deep peptide‐centric detection. Here, it is shown that DIA can be redeemed from this dependency by combining predicted fragment intensities and retention times with narrow window DIA. This eliminates variation in library building and omits stochastic sampling, finally making the DIA workflow fully deterministic. Especially for clinical proteomics, this has the potential to facilitate inter‐laboratory comparison.


DOI: 10.1002/pmic.201900306
To date, the most common peptidecentric way to address this complexity is using previously identified peptides from DDA as targets in the DIA data. [1] First, DDA peptide identifications are translated into a spectral library with Peptide Query Parameters (PQPs), which typically contain the sequence as well as the analytical coordinates (m/z, intensity, and retention time or RT) for the observed ions for a given peptide. These PQPs are then used to compute an evidence score for each target peptide, based on its fragment traces in DIA. [2] Ultimately, these evidence scores are supplemented with additional features, e.g., ppm and RT errors, allowing a semi-supervised machine learning algorithm to weigh and re-score the target peptides to obtain a maximum of true targets at an empirically determined false discovery rate (FDR) using the target-decoy approach. [3][4][5] Unfortunately, deriving PQPs from DDA data intrinsically means transferring its limitations. In fact, fractionation, stochastic data acquisition, processing, and identification introduce bias in the library and require considerable effort. This compromises inter-laboratory comparison and can even alter the biological conclusions between laboratories. [6] However, thanks to the availability of state-of-the-art prediction algorithms, these PQPs can now be predicted directly, setting the stage for much easier and much more reproducible peptide-centric DIA data extraction. [7][8][9] Here, we compare the effect of using libraries from different origins on peptide-centric approaches, by assessing their qualitative and quantitative performance on a public wide window (10-20 m/z) DIA dataset of HeLA cells [10] (Figure 1). Three basic spectral libraries were used here, with PQPs derived from a) an experimental DDA dataset, b) a protein sequence database (FASTA), and c) a predicted spectral dataset. Each of these three libraries can be used directly as a source library, or can be converted into a DIA library by using them first on a narrow window (2 m/z) DIA dataset of the sample. The resulting six possible libraries can all be used alike by the EncyclopeDIA software to identify and quantify wide window DIA data. [10] We define in the further text (i) peptide detections as being reported by the software above 1% FDR, (ii) peptidoforms as having deconvoluting charge states and (iii) robust peptides as being detected in three separate runs with at least three transitions.
In-house or public DDA source libraries are frequently built by extensive fractionation of samples. With adequate statistical control, such proteotypic libraries allow direct peptide detections in wide window DIA (Figure 1Aa). [11] We illustrate this by using the publically available Pan-Human library, which contains nearly 10 000 proteins derived from 331 DDA runs on a range of human cell lines and tissues [12] (Figure 1Ba). To reduce the effort and variability from DDA library building, a library-free peptide-centric data analysis workflow was proposed recently. [13] Herein, the PECAN (or Walnut) scoring algorithm allows direct detection of peptides derived from a FASTA in wide window DIA data (Figure 1Ab). This is akin to a source library that i) contains only peptide sequences and m/z coordinates, and ii) lacks prior selection of proteotypic peptides. On wide window DIA data, this approach thus provides a limited number of PQPs, which is not sufficient to differentiate between the high number of false targets, i.e., true negatives, and the lower number of true positives in the library. [14] This manifests as indiscernible target and decoy score distributions, resulting in a very high false negative rate (FNR; Figure 1Bb).
Here, we propose a promising way to improve upon the FASTA source library-while still omitting prior DDA-by predicting fragment ion intensity and RT in silico (Figure 1Ac; Figures S1 and S2, Supporting Information). Using a spectral dataset with such predicted fragment intensities (MS²PIP) and peptide RTs (Elude) more than doubles the number of peptides detected in the wide window DIA (Figure 1Bc). [7,15] However, considering all tryptic peptides in a Human proteome still underperforms compared to the Pan-Human DDA library, which is fully contained in the predicted spectral dataset (Figure 1Ba,Bc). Notably, this is not due to poor prediction because predicting only those peptides present in the Pan-Human library performs very similar to using the Pan-Human library directly ( Figure S3, Supporting Information) and the underperformance can thus only be attributed to the many false targets when using the complete database. [11] An elegant way to filter out false target peptides upfront, is by measuring a pool from every condition with staggered narrow window DIA (Figure 1Ad-f). This reduces MS2 chimericity to DDA-like quality in a DIA setting, allowing detection with increased specificity. This accurate prior filtering makes the statistical burden of false targets in the wide window DIA surmountable again. Notably, due to instrument limitations this "Precursor Acquisition Independent From Ion Count" (PAcIFIC) [16] can currently only be performed by means of gas phase fractionation (GPF), i.e., sampling different m/z regions separately. [10] Still, the added acquisition depth and specificity allows for 88k (DDA), 47k (FASTA), and 95k (predicted) doubly and triply charged peptide detections as reported by the software, corresponding to 84k, 44k, and 90k peptidoforms in six narrow window GPF DIA runs of a HeLA cell lysate ( Figure S4, Supporting Information). To assure that this additional filtering is accurate, we confirmed the estimated FDR by using an entrapment experiment wherein we included Pyrococcus furiosus proteins as false targets alongside the expected human proteins in the respective source libraries. [17] Hereby, the measured FDR for narrow window DIA filtering is 2% for the DDA, 1% for the FASTA, and 1% for the predicted source library, in accordance with the theoretically estimated FDR based on the target-decoy strategy. In the process, we can measure the identification cost of adding false targets: adding 3-6% false targets results in an

Significance Statement
Data-independent acquisition (DIA) is quickly developing into the most comprehensive strategy to analyse a sample on a mass spectrometer. Correspondingly, a wave of data analysis strategies has followed suit, improving the yield from DIA experiments with each iteration. As a result, a worldwide wave of investments in DIA is already taking place in anticipation of clinical applications. Yet, there is considerable confusion about the most useful and efficient way to handle DIA data, given the plethora of possible approaches with little regard for compatibility and complementarity. In our study, we outline the currently available peptide-centric DIA data analysis strategies in a unified graphic called the DIAmond DIAgram. This leads us to an innovative and easily adoptable approach based on predicted spectral information. Most importantly, our contribution removes what is arguably the biggest bottleneck in the field: the current need for data-dependent acquisition (DDA) prior to DIA analysis. Fractionation, stochastic data acquisition, processing, and identification all introduce bias in the library. By generating libraries through data independent, i.e., deterministic acquisition, stochastic sampling in the DIA workflow is now fully omitted. This is a crucial step toward increased standardization. Additionally, our results demonstrate that a proteome-wide predicted spectral library can surrogate an exhaustive DDA Pan-Human library that was built based on 331 prior DDA runs.
average decrease of 1-2% in detections (see Entrapment Section in Supporting Information Methods).
Additionally, the peptide detections in narrow window DIA can be translated into novel and integrated PQPs, which are calibrated to the specific LC-MS system and are specific to DIA ( Figure 1A). This approach was recently made readily applicable as chromatogram libraries: DIA libraries of narrow window DIA peptide detections comprising their calibrated PQPs. [10] Such chromatogram libraries outperform direct wide window DIA extraction for every source library. The modest gain for a DDA source library (≈20%) derives mainly from PQP calibration, as only 50% of the source peptides was filtered out (Figure 1Ba,Bd). In contrast, in the FASTA source library, 98.5% of the peptides were filtered out, and RT and intensity coordinates were generated de novo. Taken together, this resulted in the largest gain (≈170%; Figure 1Bb,Be). Finally, the chromatogram library derived from a predicted spectral library increases the number of detections by ≈100% compared to direct wide window DIA data extraction, making it the most efficient overall peptide detection strategy of the DIAmond DIAgram (Figure 1Bc,Bf). Importantly, when looking only at robust peptide detections, i.e., with a minimum of three transitions and found in triplicate, the gain compared to the Pan-Human library is rather modest. Additionally, the robust peptides detected by all three chromatogram libraries show a large overlap, convincingly showing that the Pan-Human library is very exhaustive and that all three chromatogram libraries mainly detect proteotypic peptides Peptide-centric approaches rely on libraries (central column) that contain Peptide Query Parameters (PQPs), which are derived from the peptide sequence and can additionally contain the three ion coordinates, i.e., mass to charge ratio (m/z), Intensity (Int), and retention time (RT) (three-part pie charts). These can either be experimental (blue), theoretical (grey), or predicted (red). PQPs are used to score the evidence of peptide detections in continuous DIA data (boxes). These are supplemented with additional features of the match so that a support vector machine can weigh and re-score them to obtain a maximum of true targets at an empirically determined FDR using the target-decoy approach (arrow heads). DDA source libraries (both in-house and public) only comprise prior proteotypic peptide identifications and contain measured PQPs for all three ion coordinates. These are therefore directly applicable to quantify peptides in 10-20 m/z wide window DIA (Wide DIA) data (a). However, when a proteome FASTA is used as a source library, sensitivity is reduced (dashed arrow), i.e., too many false negatives are produced due to the high statistical burden (b). This also holds for libraries with predicted fragment intensities (MS²PIP) and RT (Elude), albeit to a lesser extent (c). Prior 2 m/z narrow window DIA (Narrow DIA) provides the specificity to remove false targets in the sample first (d-f). The DIA ion coordinates from these detections can additionally be integrated into new and calibrated PQPs (cal). These DIA libraries, called chromatogram libraries, can be derived from any source library (triple arrow). B) Doubly and triply charged peptide detections in wide window DIA following each of the routes depicted in (A). Shading highlights the number of robust peptides that is detected in triplicate wide window DIA runs with at least three transitions, allowing robust quantification. C) Comparison of the identified robust peptides in Wide DIA for route (d-f). The large overlap shows that all three approaches detect proteotypic peptides. Only peptides of double and triple charge that are detected in triplicate wide window DIA runs with at least three transitions are shown.
( Figure 1C). Peptides unique to the Pan-Human library include very high molecular masses that were not predicted, high molecular weight peptides that generate many doubly charged transitions that are not predicted by default, as well as very small peptides with inherently poor RT or fragmentation pattern predictions. Peptides that are unique to the predicted library are all peptides that were not present in the Pan-Human source library and are very low abundant in the wide window DIA data, implying they were missed during the DDA sampling in the Pan Human library ( Figure S4, Supporting Information). Note that some peptides will pass the detection threshold only in the narrow window DIA and not in the wide window DIA because of increased interference in the latter (1788 for the predicted and 673 for the Pan-Human). Importantly, the PQP requirements of the source library for building chromatogram libraries on narrow window DIA are relatively liberal: the measured Pan-Human library was acquired on a TripleTOF instrument but allows wide window DIA data peptide detection on an Orbitrap instrument. The in silico equivalent is that 95% of the detected peptides overlap when the MS²PIP engine is trained on either Orbitrap or TripleTOF data. As a result, other fragment ion intensity predictors such as Prosit and Deep Mass [8,9] perform similarly when combined with narrow window DIA [18] (Figures S5 and S6, Supporting Information). Overall, the peptide-centric workflow seems to have matured to a level that has covered much of the most obvious growing potential. Fortunately, very different ways of mining DIA data are continuously being presented, such as the use of neural networks or building ion networks. [19,20] We conclude that predicted libraries are highly relevant and performant for wide window DIA identification, and that three elements of a spectral library affect its overall performance: i) the amount of false targets included, ii) the amount of informative PQPs, and iii) the accuracy of PQPs on the specific instrument setup. In this study, we could show that a narrow window DIA www.advancedsciencenews.com www.proteomics-journal.com acquisition of six GPFs combined with a predicted spectral library of the full human proteome was able to surrogate a measured DDA Pan-Human library, thus liberating the DIA workflow from any stochastic acquisition. Especially for clinical proteomics, this can facilitate inter-laboratory comparison. Importantly, the software tools MS²PIP, ELUDE, and EncyclopeDIA are all instrument independent, publicly available, and mutually compatible, thus making this workflow immediately accessible to everybody interested.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.