A transformer architecture for retention time prediction in liquid chromatography mass spectrometry‐based proteomics

Accurate retention time (RT) prediction is important for spectral library‐based analysis in data‐independent acquisition mass spectrometry‐based proteomics. The deep learning approach has demonstrated superior performance over traditional machine learning methods for this purpose. The transformer architecture is a recent development in deep learning that delivers state‐of‐the‐art performance in many fields such as natural language processing, computer vision, and biology. We assess the performance of the transformer architecture for RT prediction using datasets from five deep learning models Prosit, DeepDIA, AutoRT, DeepPhospho, and AlphaPeptDeep. The experimental results on holdout datasets and independent datasets exhibit state‐of‐the‐art performance of the transformer architecture. The software and evaluation datasets are publicly available for future development in the field.


INTRODUCTION
The retention time (RT) of a peptide refers to the time point it eludes from a liquid chromatography (LC) column in a proteomics experiment using a mass spectrometer (MS) coupled with LC separation. The peptide RT depends largely on the intrinsic properties of its amino acid composition. Accurate RT prediction has several applications. It is an integral part of a spectral library, a collection of mass spectra and RTs, for analysis of data-independent acquisition mass spectrometry (DIA-MS) data [1][2][3]. RT prediction can also be used to improve peptide identification confidence to control false discovery rate [4] and to identify chimeric fragmentation spectra [5]. Previous works have demonstrated the superiority of the deep learning approach over traditional machine learning methods for RT prediction [4,[6][7][8][9]. In this paper, we examine the perfor-proteomics [15]. Lou et al. [16] use a transformer block after an LSTM block for RT prediction. However, there is no comparison to previous studies on common datasets. Hence, it is highly desirable to assess the performance of the transformer architecture for our task.
We propose a transformer architecture for peptide sequences that incorporates proven improvements to the original model. Specifically, we implement an adapted transformer model based on the vision transformer [13]. Previous studies have shown that changing the order of the so-called layer normalization in the architecture leads to better stability in training than the original version [17][18][19]. We will compare the performance of the proposed transformer model to state-of-the-art methods including Prosit [6], DeepDIA [8], and AutoRT [4] for regular, nonphosphorylated peptides. In addition, we will compare the transformer model with the same set of hyperparameters to two recent methods for phosphorylated peptides called DeepPhospho [16] and AlphaPeptDeep [20]. Note that we consider only the RT prediction component of the methods in the literature which also contain modules for MS/MS fragment intensity prediction or collisional cross section prediction.
Since the performance of a learning model largely depends on training data, we will train and validate the method using the same datasets as existing methods for benchmarking. We will also create an independent dataset to evaluate the accuracy and other design choices such as the constraint on maximal peptide length. Finally, we will also compare the performance of the adapted transformer architecture with the original one.  and tyrosine, respectively; and "*" denotes N-terminal acetylation. The input peptide sequence is converted to a vector of integers by replacing each amino acid of the sequence with a unique integer corresponding to its position in the string (i.e., "A" is 1, "C" is 2, "D" is 3 and so on). A class token CLS equal to the maximum value plus 1 (hence, either 21, 22, or 26) is inserted at the front of the obtained vector. The vector is subsequently mapped to a vector space using an embedding layer where the size of the vocabulary in the input data is CLS, the input length is the maximum sequence length, and the output dimension is d model .

Positional embeddings
The positional embeddings are added to the amino acid embeddings to retain positional information of the amino acid sequence. The result of the summation is the input to the encoder layers.

Encoder
A transformer consists of N encoder blocks. Each block consists of two sub-blocks, a multi-head attention layer with a padding mask followed by a feed-forward network of size d ff . Both sub-blocks have a residual connection [21] with a layer normalization [22] at the front and a dropout layer with a dropout rate r at the back. An architectural change from the original transformer ( Figure S1) is that the normalization layer is placed prior to the multi-head attention layer and the feed-forward network layers, and inside the residual block. Attention is a mathematical method to retrieve the value associated

Feed-forward network
The role of the feed-forward network is to process the output from the attention layer to better fit the input for the next attention layer.
The feed-forward network consists of two dense layers. The first layer has a Gaussian error linear unit (GELU) activation [23] with d ff outputs as in [12,24]. The second layer transforms the data back to a d model dimensional space.

Dropout
We apply dropout [25] to the output of each sub-layer. In addition, we apply dropout to the sums of the input embeddings.

Prediction head
We use two dense layers with a dropout layer in between. The first dense layer with d model nodes and a GELU activation. Second dense layer outputs the RT prediction. This is similar to the last layer of the Prosit model.

Implementation and training
We use the Tensorflow library (https://www.tensorflow.org/) for the We minimize the median absolute error (MAE) as the training objective function We use the Adam optimizer [26] with 1 = 0.9, 2 = 0.98, = 1e − 9 with a custom schedule. The learning rate at step s is with the number of warm-up steps w = 4000 as in [10].
We perform random search for hyperparameter tuning using the

Evaluation measure
We use the MAE to evaluate the performance of RT prediction models.
An R script is used to calculate the evaluation metric from the text output of each model. The Wilcoxon signed-rank test is used to assess the statistical significance of the relative performance between models.

Prosit model and data
The Python code is downloaded from https://github.com/kusterlab/

DeepDIA model and data
The source code for DeepDIA is downloaded from https://github.com/ lmsac/DeepDIA. Due to the lack of processed data, we started from the raw mass spectrometry dataset PXD005573 [1]. We performed a peptide search of the set of 17 LC-MS/MS runs as used in the Deep-DIA paper for the HeLa cells using MaxQuant version 1.6.14.0 with default settings [27]. The peptides in the MaxQuant msms.txt output are filtered for length from 7 to 50 amino acids. The median RT is used as a representative value for each peptide. The model takes a peptide sequence as input and converts it to a binary matrix, where each row represents an amino acid residue (20 rows in total), and each column stands for a residue position. To test an external dataset, we adapted the DeepDIA Python code to accept a new input file and output to a text file for further processing in R.

An independent test set
We selected from the ProteomeXchange dataset PXD019038 [29] an LC-MS/MS run with the highest number of peptide identifications.
Here the raw data were searched using MaxQuant 1.5.4.1 with default settings. The median RT is used as a representative value for each peptide. We filtered the peptides for length and oxidation modification appropriately for each prediction model.

Phosphoproteomics predictive models and data
We considered two models for phosphopeptide RT prediction, namely,

The transformer architecture is competitive to state-of-the-art architectures for RT prediction
The performance of a machine learning model is dependent on datasets. Hence, to properly assess a transformer model, we will use the same datasets that were used to train models currently applied in the field. Specifically, we will experiment with three deep learning models Prosit, AutoRT, and DeepDIA, which employ different variations of the LSTM architecture. The Prosit model consists of a bidirectional recurrent neural network with gated recurrent memory units (GRU) [35], a recurrent GRU layer and an attention layer [36], all with dropout layers at the front and the back. DeepDIA uses a hybrid architecture with a convolutional neural network (CNN) [37] and a bidirectional LSTM network [11]. AutoRT uses an ensemble of deep learning models with CNNs combined with bidirectional GRU networks. Note that DeepLC [9] is a more recent method, but its performance is equivalent to previous ones on unmodified peptides that we consider here.
The Prosit and AutoRT datasets were downloaded from publicly available resources while the DeepDIA dataset was obtained by repro-cessing the raw data (Materials and methods). Table 1  Using the training and validation sets, we developed a single transformer model for all three datasets. Figure 1 depicts the architecture, which is based on the BERT architecture with a class token (CLS) at the beginning whose learned representation is used in the last layer for prediction. The model differs from the original transformer in the placing of the normalization layer and the residual connection, and the GELU activation ( Figure S1). The RT values are normalized to [0,1] and a sigmoid activation with two fully connected layers is used as the prediction head. We implement a single Python source code for training, testing, and tuning in the Tensorflow environment. We perform three rounds of random search for hyperparameter optimization (Figure 2A

Performance on an independent dataset
To assess the performance of the models on an independent dataset, we selected from a recently published dataset [29] an LC-MS/MS run with the highest number of peptides to avoid problems with cross-run normalization. For each model trained on Prosit data, DeepDIA data, and AutoRT data, we filtered the peptides appropriately for modification and maximum length. Next, we separated peptides present in the training data of each model to have an accurate estimation of generalization error. Here the training data consists of both the training set and the validation set, and not the holdout set. As can be seen in Table 2 least square estimation. Figure 3 shows the performance of the three transformer models and three existing models. The transformer outperforms all three existing deep learning models in terms of the median error rate, the interquartile range, and the error range (Wilcoxon signed-rank test p-values < 1e−5). The error rate of models based on the DeepDIA dataset is quite high, indicating that the number of training samples is insufficient (Table 1). Finally, the model trained on the Prosit dataset is the best of the three transformer models considered.

Assessment of the adapted transformer architecture
We have adopted a recent variant of the transformer architecture which has several changes with respect to the original version. To assess their relative performance for RT prediction, we trained a model with the original architecture. Briefly, instead of positional embeddings, we used the same sine and cosine functions of different frequencies for the positional encodings [10]. The layer normalization is performed after the residual connection ( Figure S1). Finally, the ReLU activation function is used in place of the GELU activation.

Assessment of the transformer architecture for phosphopeptides
Recent methods support RT prediction for phosphoproteomics data.
DeepPhospho uses an LSTM block in front of transformer lay- We used four datasets used to develop DeepPhospho for evaluation (Materials and methods). The largest dataset [30] is converted to DeepPhospho format. Briefly, each peptide starts with either a character "*" for N-terminal acetylation or a character "@" for no N-terminal acetylation. Methionine oxidation is coded by character "1" and phosphorylation on STY is coded by "2", "3", and "4", respectively. In total,    The results on the three independent datasets (Note that these datasets were used in the development of DeepPhospho).

TA B L E 3
Datasets for independent evaluation of the deep learning models for phosphoproteomics. Finally, we visualized the amino acid embeddings of the transformer models using principal component analysis ( Figure S7). We observed that the modified amino acids are considerably different from the unmodified ones for both nonphosphorylated dataset (oxidation of methionine) and phosphoproteomics dataset (all modifications covered).

DISCUSSION
We have implemented a transformer architecture that outperforms currently used state-of-the-art methods for RT prediction. The experimental result shows that the training data are important, and in our study the Prosit dataset delivered the best result for nonphosphorylated peptides. In addition, the transformer model without any additional hyperparameter tuning delivers the best predictive result for phosphoproteomics datasets in comparison to DeepPhospho and AlphaPeptDeep.
The variant of the transformer we implemented is better than the original transformer model, thereby confirming previous observations in the literature for other types of data [17]. We expect that the change of the order of the layer normalization and multi-head attention contribute most to the improvement. There is some theoretical explanation on why pre-normalization is better [17][18][19]. In the original work of transformer by Vaswani et al. [10], the authors experimented with positional encoding and positional embedding and concluded that there was no change in performance. Here, we have made some design choices based on our preference from the existing BERT and vision transformer architecture. We leave further analysis and fine-tuning of the transformer architecture for future research.
Unlike other deep learning models with elaborate architectures, our model for RT prediction is highly compact. We only add an embedding layer for input and an output layer for prediction. Hence, existing software for the transformer model can be readily employed. A highperformance transformer for proteomics is interesting because the architecture is highly successful in various other application fields and therefore widely supported by the community. The transformer architecture already has software robustness, scalability, and ease of deployment. As accuracy of deep learning models for RT prediction for unmodified peptides is reaching a saturation point as illuminated here by the overlapping boxplots in Figure 3 and in recent publications [4,9], other factors such as usability will become important to facilitate a wider user base. This is particularly important for users with limited experience with machine learning tools.
Note that four out of five methods in the literature we considered have modules for MS/MS ion intensity prediction and are integrated in end-to-end systems for DIA-MS data analysis. In our opinion, MS/MS intensity prediction can be treated separately because of the different underlying physical properties. Our results and datasets will enable either fine-tuning these end-to-end systems or replacing their RT predictive component.
A common benchmarking environment is important to further develop prediction tools for proteomics data analysis [40]. Our effort in deploying the five existing deep learning models was considerable.
So was our effort in data preparation. The availability of our software and datasets is a contribution in this regard.