In silico prediction of Severe Acute Respiratory Syndrome Coronavirus 2 main protease cleavage sites

Abstract One of the emerging subjects to combat the SARS‐CoV‐2 virus is to design accurate and efficient drug such as inhibitors against the viral protease to stop the viral spread. In addition to laboratory investigation of the viral protease, which is fundamental, the in silico research of viral protease such as the protease cleavage site prediction is critically important and urgent. However, this problem has yet to be addressed. This article has, for the first time, investigated this problem using the pattern recognition approaches. The article has shown that the pattern recognition approaches incorporating a specially tailored kernel function for dealing with amino acids has the outstanding performance in the accuracy of cleavage site prediction and the discovery of the prototype cleavage peptides.


| INTRODUCTION
SARS-CoV-2 is a single-stranded RNA genome and belongs to the coronavirus family and is composed of 23 ORFs. 1 Among them, ORF1a and ORF1b are translated to two polyproteins, 2 which can be cleaved by the viral proteases to generate 16 nonstructural proteolytic proteins. 3 The cleavage in these ORFs is mainly carried out by the chymotrypsin like 3CL cysteine protease (main protease). The papain-like protease (PLpro) carries out three cleavages. 2 A protease works when it interacts with a specific site in the amino acid sequence of a polyprotein. The site at which a polyprotein is cleaved by a protease is called a protease cleavage site. A collection of the consecutive residues surrounding a protease cleavage site is called a cleaved peptide expressed as P m ÁÁÁP 2 P 1 # P 0 1 P 0 2 ÁÁÁP 0 n . In this expression, # stands for the cleavage, P 1 ≤ i ≤ m stands for a N-terminal residue and P 0 1 ≤ j ≤ n stands for a C-terminal residue. The coronavirus main protease cleavage always happens at the amino acid glutamine in a polyprotein, that is, P 1 ¼ Q. However, the cleavage of a protease on subsites allows certain variation of the amino acid distribution, that is, the tolerances. 4,5 Only when the fitness between a protease and a substrate is satisfied, the protease will bind to a polyprotein at the substrate to cleave the polyprotein. This is called the lock-and-key mechanism. 6,7 The glutamines whose substrates do not fit the structure requirement for a protease will not be targeted by the protease for the cleavage.
Laboratory identification of the protease cleavage sites within a polyprotein is the best technology, but it is more expensive and time-consuming. 8,9 It is better to use an efficient in silico approach to screen out a subset of the glutamine residues which are the most probable protease cleavage sites. The best in silico approach is to use a pattern recognition approach model. The main data used for the in silico protease cleavage site prediction are peptides as aforementioned. In this study, a peptide of a main protease cleaved glutamine residue is called a cleaved main protease peptide, or a cleaved peptide for short in the rest discussion. Such a glutamine residue, which interacts with the main protease, is called a main protease cleavage site, or a cleavage site for short in the rest discussion. Other glutamine residues are called the uncleavable sites and the peptides at these sites are called the uncleavable peptides.
Various pattern recognition approaches have been employed to construct in silico predictive models for the protease cleavage pattern discovery based on the available known cleaved peptides for different viral proteases, but yet SARS-CoV-2 viral protease so far. For instance, a logistic linear regression model and a linear discriminant analysis model were used to predict the HIV-1 protease cleavage sites, 10,11 a decision tree model was constructed for the tryptic cleavage site prediction, 12 a support vector machine model was constructed for the caspase cleavage site prediction, 13 a multi-layer perception (or artificial neural network) model was constructed for predicting various protease cleavage sites 14 and a random forest model was constructed for predicting various protease cleavage sites as well, 15 to name a few.
The collection of the cleaved peptides is straightforward. All the experimentally verified cleavage sites for a protease can be used to generate the cleaved peptides. To collect uncleavable peptides, a specific rule must be followed, that is, they should contain sufficient background information for them to be compared with the cleaved peptides. 16 For instance, the coronavirus main protease only cleaves at a sequence where the P 1 residue is Q. 2,17,18 Therefore, a uncleavable peptide for the coronavirus main protease cleavage site prediction must target the uncleavable glutamine residues only.
Having known that the prediction of the SARS-CoV-2 main protease cleavage sites requires the glutamine peptides as inputs, the next issue is how to present glutamine peptides to a pattern recognition model. Most pattern recognition approaches only accept numerical data as the inputs. Therefore, the amino acids of the peptides must be transformed to some numerical data at first. This is called an amino acid encoding process. There have been many different methods for transforming the amino acids to numerical values. The mostly wellemployed methods in the literature include the binary encoding approach, 19 the descriptor encoding approach [20][21][22] and the profile encoding approach, 6,23 to name a few.
In addition to these approaches used to transform amino acids to numerical data, a question is whether there is an alternative to handle the non-numeric amino acids in a pattern recognition model, which can be more biologically sound. It has been found that the structure of a protease will not be varying very fast during an evolution 24 (Yen et al.,25 ). Most importantly, a protease will have some degree of the tolerance to target a cleavage site in a polyprotein for the interaction even if genetic evolution may have occurred. 26,27 Therefore, the correlation between the cleaved peptides should be statistically significant compared with the correlation between uncleavable peptides or the correlation between uncleavable and cleaved peptides. Based on this understanding, the kernel function 28,29 can be used to map the original non-numerical peptide space to a numerical kernel space based on the correlations between the available peptides and the cleaved peptides. This has led to the development of the bio-kernel function as an alternative approach to deal with non-numerical amino acids. The Supporting Information Document S2 provides the details of the bio-kernel function.
As aforementioned, in silico protease cleavage site prediction is a pattern recognition problem. Though the SARS-CoV-2 main protease has been researched in the laboratory, 30,31 the in silico prediction of the cleavage sites of this protease has yet to be addressed. This study, for the first time, examines the in silico prediction of the SARS-CoV-2 main protease cleavage sites using the pattern recognition approaches, especially incorporating the kernel function. This article will show two important findings. First, the SARS- CoV-2 main protease cleavage sites are predictable with high accuracy by an in silico   model because the cleaved glutamine peptides have reserved an   excellent cleavage pattern for separating the cleaved peptides from   the uncleavable peptides. Second, the pattern recognition approaches incorporating with the kernel function works the best.

| MATERIALS AND METHODS
In total, all available 64 SARS-CoV-2 protein sequences which contain the main protease cleavage sites were downloaded from NCBI using  Table S1 lists these 64 sequences. There were 273 main protease cleavage sites within part of these 64 sequences.
A cleaved peptide was generated for each cleavage site, which is expressed by P 5 P 4 P 3 P 2 P 0 1 P 0 2 P 0 3 P 0 4 P 0 5 . In this notation, the residue P 1 was removed because the coronavirus main protease always targets the amino acid glutamine (Q). 2,17,18 After removing the duplicated peptides, 116 non-redundant cleaved peptides were maintained for the study.
Correspondingly, non-redundant uncleavable peptides were also randomly selected from these 64 sequences. The following rule was used to select uncleavable peptides. Suppose one sequence had K cleavage sites and M > K non-cleavage glutamine residues for the main protease. K of M non-cleavage glutamines were randomly selected to generate K uncleavable peptides. This generated 273 uncleavable peptides. The duplicated 9-mer uncleavable peptides were also removed resulting in 259 non-redundant uncleavable peptides. Therefore the data set was composed of 375 9-mer peptides for the in silico prediction of the SARS-CoV-2 main protease cleavage sites in this study.
In addition to 273 uncleavable glutamines (Q), the rest 5071 glutamines (hence, 5071 9-mer uncleavable peptides) were not abandoned. After redundancy clearance, these 5071 uncleavable peptides were reduced to 2360 nonredundant uncleavable peptides. These 2360 non-redundant uncleavable peptides were saved for the blind test of the constructed models. In theory, all of these 2360 blind uncleavable peptides were expected to be classified as the uncleavable ones using a pattern recognition model if it was well constructed. Table 1 shows these 116 cleaved peptides and the proteins as well as the cleavage sites. Figure 1 shows the sequence logo for the 116 9-mer cleaved peptides. Figure S1 shows the sequence logo for the 259 9-mer uncleavable peptides. Comparing these two sequence logos, it can be seen that the uncleavable peptides had no trend of the amino acid composition. However, the cleaved peptides displayed a distinct trend of the amino acid composition. For instance, the residue P 2 (labeled by 4 in Figure 1) was mainly occupied by the amino acid leucine (L) in addition to valine (V), methionine (M), and isoleucine (I). The reside P 0 1 (labeled by 5 in Figure 1 Protein and sites TVRLHAGSAT R1A_BCHK9#4094 serine (S) in addition to alanine (A), asparagine (N), and glycine (G).
Therefore, it is expected that two types of peptides (cleaved vs. uncleavable) should not be very difficult to separate in this data set.
Based on the comparison between Figure 1 and Figure S1, it can be seen that residues P 5 , P 0 3 , P 0 4 , and P 0 5 (labeled by 1, 7, 8, and 9 in Figure 1) may not have a significant contribution to the discrimination between the cleaved and uncleavable peptides. Therefore, Protein and sites RSIMQSTDMA R1A_PEDV7#4100 Note: The # key is used to separate between a protein and a cleavage site. Multiple protein sequences may contain an identical peptide. For instance, the peptide SAVLQSGFRK was found in four protein sequences (R1AB_SARS2, R1A_SARS2, R1AB_SARS, R1A_SARS).
The sequence logo of 116 cleaved peptides, where the integers from 1 to 9 represent the residues, Note that the glutamine (Q) has been omitted another data set used for the SARS-CoV-2 main protease cleavage site prediction in this study was based on the peptide structure of five residues, that is, P 4 P 3 P 2 P 0 1 P 0 2 . After reducing the peptide size from 9 to 5, the redundancy among peptides was checked again. This led to 87 non-redundant cleaved 5-mer peptides, 256 non-redundant uncleavable 5-mer peptides and 2061 non-redundant blind uncleavable 5-mer peptides. Table 2 summarizes the number of   peptides. A pattern recognition model, which is a classifier in this case, can thus be constructed to examine the discriminative power for either data set, that is, the 9-mer peptide set and the 5-mer set for the purpose of the in silico prediction of the SARS-CoV-2 main protease cleavage sites. To construct a classifier, three issues were considered. The first issue was how to present (encode) the amino acids into a model. This is because most pattern recognition algorithms only accept numerical data. Different methods of dealing non-numerical amino acids have different efficiency and reliability.
The binary-encoding, descriptors, profiling and the bio-kernel function approaches were considered in this study. Although there are many others, these have been the most popularly used in the literature. The second issue was the selection of the pattern recognition approaches. There is normally no rule-of-thumb for determining which is superior in advance and there is a need for careful examination of each. Seven most popularly used and representative as well as matured pattern recognition approaches have been employed in this study. The third issue was how to evaluate such a model when it has been constructed. The cross-validation method as well as the ROC analysis approach was employed in this study. Refer to the extended methods for details of these three methods.
The kernel function has been well exercised in the machinelearning field. 28,29 A naïve description of the kernel function is briefly described here. One of the most promising advantages of the kernel function is that it can transform a nonlinearly separable space to a linearly separable space. For instance, two classes of data points in the original space (A, B, α, and β) in the left panel of Figure 2 are nonlinearly separable. When using two data points (α and β) as the kernels, the distances between four data points and these two kernels can be calculated. Based on the distances, a new space is formulated shown in the right panel of Figure 2. It can be seen that these four data points in this new kernel space coordinated by α and β become linearly separable.

| RESULTS AND DISCUSSION
The first question for any pattern recognition model is whether the variables themselves possess a good discriminative power. Therefore,   Figure 3. The number of cells occupied by at least one peptide was 154, that is, the occupancy rate was 68.89%. The purity rate of the cells was 91.47%, which was lower than 94.13%. Figure S4A shows the SOM model constructed based on the descriptor-encoded data. The number of cells occupied by at least one peptide was 148, that is, the occupancy rate was 65.78%. The purity rate of the cells was 75.47%, which was much lower than 94.13%. Figure S4B shows the SOM model constructed based on the profile-encoded data. The number of cells occupied by at least one peptide was 141, that is, the occupancy rate was 62.67%. The purity rate of the cells was 78.67%, which was also much lower than 94.13%.
It can be seen that the bio-SOM model maximally explored the discriminative power from this SARS-CoV-2 protease peptide data, being 94.13%. Therefore, 94.13% should be considered as the benchmark when evaluating the supervised pattern recognition models constructed for the in silico prediction of the SARS-COV-2 main protease cleavage sites. Table 2 shows the performance of two sets of 23 supervised pattern recognition models. For the 9-mer peptides data, eight models had the total prediction accuracy over 94.13%. Therefore, the discriminative power was well explored in these supervised models. All the profile-encoded models were not able to achieve the total prediction accuracy greater than 94.13%.
Five models had no Type I error on the blind data. All were the biokernel models. Four best models had the total prediction accuracy greater than 94.71% and had no Type I error. They were bio-FOREST, bio-MLP, bio-SVM, and bio-RVM. Among them, the bio-SVM model was the best.
Its AUC was 1, its MCC was 0.9938, and its total prediction accuracy was 99.73%. The total prediction accuracy of bio-SVM was about 5% greater than that of bio-SOM, which was a significant increase.
For the 5-mer peptides data, eight models had the total prediction accuracy greater than 94.13% and five models had no Type I error on the blind data. The best 5-mer model was the bio-SVM model. Its AUC F I G U R E 2 A naïve description of the kernel function approach. The left panel shows the original data space coordinated by x and y, in which four data points are labeled by A, B, α, and β. A and B belong to one class while α and β belong to the other class. They are nonlinearly separable because it is impossible to separate these two classes using one straight line. Suppose α and β are selected as the kernels. The distances between four data points and two kernels are calculated. The right panel shows the distribution of four data points based on four sets of distances using the kernel function. In this new space, two coordinates are no longer x and y, but α and β. It can be seen that this new space of four data points becomes linearly separable The bio-SOM map of 225 neurons constructed for the 9-mer peptides data. "N" stands for the uncleavable peptides and "C" stands for the cleaved peptides. One circle stands for one neuron or one cell. The printed letter in a cell, which is either N or C, stands for a peptide, which has been mapped to the cell. For instance, two cleaved peptides were mapped to the first cell at the bottom row while one uncleavable peptide was mapped to the third cell at the bottom row. These two cells were pure for one class. However, the second cell at the top row contained two cleaved peptides and one uncleavable peptide. Thus, this cell was not pure for one class was 0.9999, its MCC was 0.9770, and its total prediction accuracy was 99.42%, which was 5% greater than the benchmark accuracy 94.13%.
Comparing all the models, it can be seen that the bio-kernel models (bio-FOREST, bio-MLP, bio-SVM, and bio-RVM) performed the best for the SARS-CoV-2 main protease cleavage site prediction.
Other models either failed to have the total prediction accuracy greater than 94.13% or failed to have 0% Type I error rate in the blind data set testing. Figure S5 shows the ROC curves of the bio-SVM models. They were consistent with the figures included in Table 3. Figure S6 shows the densities estimated for the predictions on the blind data using the bio-SVM models. It can be seen that the prediction values were all smaller than the threshold 0.5, which was the default threshold when prediction values were between 0 and 1 for the discrimination between two classes of peptides. This means all of uncleavable glutamine residues in the blind test data set were accurately predicted as uncleavable ones. Figure 4 shows the prediction spectra of four bio-kernel models for the protein R1AB_SARS2. All demonstrated the excellent discriminative power between the cleaved glutamines and uncleavable glutamines. The bio-RVM model was outstanding because the prediction values of all the uncleavable glutamines were almost zero.
As indicated in the earlier studies 30 32 and to discover the tryptic cleavage pattern. 12 However, in these applications, the data used in a decision tree model were the peptide residues, that is, one residue was one variable. For instance, a resulting decision tree model based on the 5-mer SARS-CoV-2 main protease cleavage peptides data in this study can thus explain which of the 5 residues (either P 4 or P 3 or P 2 or P 0 1 or P 0 2 ) play an important discriminatory role that separates the cleaved glutamine residues from the uncleavable glutamine residues for the coronavirus main protease.
T A B L E 3 The model performance. "Type I" stands for the Type I error rate Note: "NO" means no encoding process was used. "BIN" stands for the binary-encoded data. "DES" stands for the descriptor-encoded data. "PSE" stands for the profile-encoded data. The percentages in bold were greater than 94.13% of the bio-SOM model.
Rather than using the raw residues as the variables, a bio-kernel inductive pattern recognition model employs the cleaved peptides as the variables. Thus, a bio-kernel inductive pattern recognition model (bio-C5.0 and bio-FOREST) was able to discover which cleaved peptides were the most significant ones for the discrimination between the cleaved peptides and the uncleavable peptides. These most discriminating cleaved peptides were then the most probable prototypes as the targets for the drug (inhibitors) design. 6,33 The bio-C5.0 model and the bio-FOREST model constructed for the 9-mer peptides are shown in Figures S7 and S8, respectively. The bio-C5.0 model employed nine cleaved peptides and the most important cleaved peptide was GVNLGSGKTT. The bio-FOREST tree employed 11 cleaved peptides and the most important cleaved peptide was DTTVGSKDTN. Among them, eight were significant because their p values were less than .01. The decision tree algorithm partitions a space using the orthogonal partitioning rules while the random forest algorithm partitions a space using the non-orthogonal partitioning rules. Therefore, the resulting bio-C5.0 and bio-FOREST models were different. Figures S9 and S10 show the sequence logos of the cleaved peptides employed by these two trees. It can be been that most cleaved peptides selected in the trees had leucine (L) in the residue P 2 and S (serine) in the residue P 0 1 . Compared with Figure 1, these sequence logos show clear amino acid composition trends. The cleaved peptides selected by these tree models represent their importance (measured by the p values) to discriminate between the cleaved peptides and the uncleavable peptides. In terms of the use of the biokernel technique, such a selected cleaved peptide has the following property. The majority of the cleaved peptides have high alignment scores with this selected cleaved peptide while the majority of the uncleavable peptides have low alignment scores with this selected cleaved peptide. Therefore, such a selected cleaved peptide is very different from uncleavable peptides and these two selected cleaved peptides are most different from uncleavable peptides compared with other cleaved peptides.
The consensus peptide of the bio-C5.0 model ( Figure S7) was ÀV À LS À À À À and the consensus peptide of the bio-FOREST model ( Figure S8) was À À ÀLS À À À À. The latter was identical with the consensus sequence derived from all cleaved peptides. This means that the main protease cleavage rule can be simplified to either ÀV À LSG # Q À À À À or À À ÀLSG # Q À À À À, where # stands for the cleavage site. If only checking whether the consensus peptide was present in the peptides, these two consensus peptides can be used to scan all the peptides to examine whether they matched. Table S2 shows the confusion matrices. It can be seen that although the consensus peptide À À ÀLS À À À À was far less than perfect, it outperformed the consensus peptide ÀV À LS À À À À. This is not a surprise because the residue P 4 was not mainly occupied by the amino acid valine (V). Figure S11 shows how the 20 amino acids were distributed at the residue P 4 among the SARS-CoV-2 main protease cleaved peptides. The amino acid valine (V) had the largest frequency (29%) at this residue, the frequency of the serine (S) was 23%, and the frequency of the threonine (T) was 18%. The difference between three frequency values was actually not insignificant.
To further validate the discriminative power of the consensus peptides, the mutation matrix BLOSUM62 was used to calculate the homology alignment scores between the consensus peptides and all the peptides. The calculation method is included in the Supporting Information Document S7. Figure S12 shows the estimated densities of the homology alignment scores. It can be seen that the density of the uncleavable and the density of the cleaved peptides had a higher degree of overlap using the consensus peptide ÀV À LS À À À À. The overlap degree was smaller when using the consensus peptide À À ÀLS À À À À. Therefore, the bio-FOREST model may be able to deliver more robust decision-making rules for this data set.
In addition to identifying the consensus peptides, another question was whether the cleaved peptides can be ranked in terms of their discriminative power. A random forest model can rank the variables.
In the bio-kernel space, the variables were the cleaved peptides.
Therefore, the bio-FOREST model can rank the cleaved peptides in terms of their discriminative power between the cleaved and uncleavable peptides. The increase in node purity measurement of the bio-FOREST model was used to rank the cleaved peptides.
Figures S13 and S14 show the amino acid distribution trend of the top 10 cleaved peptides selected by the bio-FOREST model. Again, the residue P 2 was always occupied by the amino acid leucine (L) and the residue P 0 1 was almost occupied by the amino acid serine (S). This was again consistent with what has been discussed above.
The next issue was whether the top-ranked cleaved peptides can reserve a good discriminative power after the cleaved peptides were ranked. If so, later drug (inhibitor) design can have a parsimonious F I G U R E 4 The prediction spectra of four bio-kernel models (bio-FOREST, bio-MLP, bio-SVM, and bio-RVM) for the protein R1AB_SARS2. The protein had seven main protease cleavage sites. The heights of the bars stand for the predicted values which have been normalized between 0 and 1. The bars with the dots on the top stand for the true cleavage sites structure to investigate. Based on the top 10 cleaved peptides selected by the bio-FOREST model, the parsimonious models were constructed. In a parsimonious model, only top 10 cleaved peptides were used as the kernel peptides. These 10 cleaved peptides were used for the following analysis. Four models which demonstrated the best performance shown in Table 3 were re-constructed to examine whether these parsimonious models had the performance significantly decreased or not. These four models were bio-FOREST, bio-MLP, bio-SVM, and bio-RVM. Table 4 shows the result. It can be seen that the performance was indeed worse, but the difference between the full models and the parsimonious models was insignificant. The decreased accuracy was not a surprise because the rest of the unused cleaved peptides may carry some extra discriminative power though minor. increased from 0% to 2%. The parsimonious bio-MLP and bio-RVM models also had a decrease in accuracy. Therefore, a parsimonious model sacrificed the accuracy slightly. However, using less than 10% variables, the decrease in the prediction accuracy was insignificant.
This study has shown that the SARS-CoV-2 main protease cleavage pattern has been well-reserved in peptides. For the 9-mer peptide data, only one model had the AUC value below 0.9 and only five models had the total prediction accuracy below 90%. For the 5-mer peptide data, no model had the AUC value below 0.9 and six models had the total prediction accuracy below 90%. Importantly, the bio-SOM model demonstrated 94.13% total prediction accuracy meaning that the amino acid composition trend or pattern inherent in the cleaved peptides was significant in terms of the discriminative power between the cleaved peptides and the uncleavable peptides. The use of a supervised model further explored the discriminative power when the model was able to capture the complexity within the peptide data.
Based on the above analysis of the in silico analysis results, it can be seen that the pattern recognition models incorporating the biokernel function outperformed other models which employed various amino acid encoding approaches. The reason may be due to the use of the amino acid mutation matrix which can make the mutual relationship between peptides more biologically sound. Mapping a difficult-to-model space to a kernel space for efficient data modeling including regression analysis and classification analysis has been well exercised in the pattern recognition area. Mapping a non-numerical peptide space to a numerical bio-kernel space has two benefits. First, the difficulty of handling non-numerical peptides is eased. Second, which is more important, discovering the most probable prototypes for SARS-CoV-2 drug (inhibitor) design can benefit. This feature may not be possible using any approach other than the bio-kernel models.

| CONCLUSION
Predicting protease cleavage sites in silico aims to generate a predictive model in a computer based on the known cleaved and uncleavable glutamine residues (peptides). Therefore, it is a typical pattern recognition problem. The basic assumption of modeling peptides for a protease cleavage problem using an in silico approach is that there should be sufficient known cleaved peptides verified in laboratory and the most importantly the cleaved peptides should well cover the amino acid composition trend for a specific protease to recognize. If there are insufficient known cleaved peptides or the available known cleaved peptides have not yet well covered the amino acid composition trend for a specific protease, efficiently predicting protease cleavage sites in silico would be impossible. The pattern recognition approaches have been well used for predicting protease cleavage sites in silico where the peptide data can well-satisfy the above two conditions. The benefits of using a pattern recognition approach for this kind of problem are obvious. First, some complex or nonlinear pattern can be well explored using a nonlinear pattern recognition model. For instance, in the 9-mer models shown in Table 3  Again, the former was a linear model and the latter was a nonlinear model. Second, a pattern recognition model can be used to deliver useful information for the drug or inhibitor design if the cleavage knowledge can be well discovered. It has been well recognized that the bio-kernel models can generate a predictive model with a better generalization capability in addition to biological sound content. The bio-kernel models used in this study have shown their powerfulness for predicting the SARS-CoV-2 main protease cleavage sites in silico.

PEER REVIEW
The peer review history for this article is available at https://publons.