A structure-based method was used to characterize the non-covalent interactions of HLA-A*0201 with its peptide ligands. In this procedure, protein and peptide atoms were classified into 16 types in terms of their chemical property and local environment, and a 16 × 16 matrix was then defined to describe the interaction mode of 256 atom-pairs between the receptor and ligand in a complex structure. Three biologically related chemical forces as electrostatic, van der Waals, and hydrophobic potentials were separately calculated for each element of the matrix to yield 768 structural descriptors encoding the detailed information about the non-covalent interactions involved in protein–peptide binding. We employed this method to perform quantitative structure–activity relationship (QSAR) study of a data panel consisting of 419 non-apeptides with known binding affinities to HLA-A*0201 protein. Several QSAR models were constructed using partial least square regression (PLS) coupled with or without genetic algorithm (GA)-variable selection, and these models were validated rigorously and investigated systematically by using external test set and one-way analysis of variance. Results show that diverse properties have significant contributions to the HLA-A*0201–peptide binding. Particularly, the hydrophobicity and electrostatic property at the anchor residues of peptides confer a significant specificity and stability for the bound complexes.
Major histocompatibility complex (MHC) molecules are highly polymorphic cell surface molecules that present peptide ligands on the cell surface for inspection by T lymphocytes. MHC is categorized into three classes, of which the class I molecules (MHC I) complexed with short peptides allow CD8+ T cells to scan the content of cells and recognize cells that present abnormal peptides, such as those derived from proteins encoded by microbial pathogens and cancer antigens (1). Human MHC I molecules, also called human leukocyte antigens (HLA), are encoded by three separate but homologous genetic loci, HLA-A, HLA-B, and HLA-C (2). The allele HLA-A*0201 is one of the most frequent class I alleles in many different populations. For example, it is expressed in approximately 50% of Caucasians and has been demonstrated to play a critical role in antigen presentation of both viral and tumor antigens from a variety of cancers (3–5). Peptides that bind to the HLA-A*0201 are usually 9 ± 1 amino acids long, and its anchor residues playing functional roles are specifically residues at position 2 and position 9 (P2 and P9) (6). In addition to P2 and P9, residues at P1, P3, and P7 (referred to as secondary anchor residues (7)) also play important roles to the binding. Although numerous studies, including both experimental and theoretical aspects, have been addressed on the molecular mechanism underlying HLA-A*0201–peptide binding, the nature of the forces driving protein–peptide interactions is not yet well understood today.
Identification of the peptide fragments recognized by HLA-A*0201 is the first step for understanding the structural basis of HLA-A*0201–peptide interactions. Although several experimental methods have been developed to quantitatively determine the binding affinity between the HLA-A*0201 and peptide fragments (8), it is too time-consuming and expensive to synthesize all potential peptides found in the pathogen and cancer proteomes and to perform the protein–peptide binding assay to identify the HLA-A*0201-binding partners (9). Computational approaches as an alternative to rapidly screen virtual peptide library were utilized to address this issue, and it is now becoming an important branch of the newly emerged subject, immunoinformatics (10). Early, Pamer and Hill proposed simple motif method to identify putative peptides binding to MHC from protein sequences (11,12). However, systematic studies demonstrated that the simple motif cannot provide sufficient information to accurately determine target peptide fragments (13). Kubo et al. further extended the simple motif into non-anchor residues, and in this way, the correct rate of classification on a set of peptide candidates was increased remarkably (14). Later, with the application of new machine learning algorithms and bioinformatics techniques to identify epitope-peptides from protein sequences, the predictive accuracy and efficiency on unknown peptide sequences have received a significant improvement (15–17). On the other hand, a number of computational methods were developed to quantitatively determine the binding affinity of MHC with peptide ligands. The independent binding of side chain (IBS) hypothesis and weight coefficient matrix were the first attempt for this purpose (18,19) and have been widely used in the subsequent studies (20,21). Recently, quantitative structure–activity relationship (QSAR) was successfully applied to predict and explain the binding behavior of MHC–peptide complexes. For example, Doytchinova and co-workers employed three-dimensional (3D) structure descriptors and CoMFA/CoMSIA to construct 3D QSAR models for several sets of activity-known HLA-A*0201-restricted peptides (22,23). Further, they proposed a sequence-based approach called additive method to account the contribution of independent residues and their interactives to the binding (24). Recently, Tian et al. (25,26) have successfully developed amino acid descriptors to characterize the primary sequences of HLA-A*0201-binding peptides, and results showed that the peptide-binding affinity can be, in part, added linearly by amino acid properties. Davies et al. proposed a new strategy called static energetic analysis to characterize the interactions between HLA-A*0201 and its peptide ligands. In this method, the decomposed interaction energies were correlated with the experimentally determined affinities by using chemometrics approaches (27,28).
With the number of solved HLA-A*0201–peptide complex 3D structures growing up rapidly in recent years, it is possible to predict the peptide-binding affinity by utilizing the information derived from crystal structures. In this study, a structure-based method was used to characterize the non-covalent interactions of HLA-A*0201 with epitope-peptides (29). In this method, the atoms at protein–peptide interface are classified and paired to generate a 16 × 16 atom-pair matrix to describe the receptor–ligand interaction profile, and then three biologically related non-covalent forces are fed into this matrix to generate 768 descriptors encoding the detailed information about the non-covalent interactions involved in protein–peptide binding (29). Based upon a data set consisting of 419 HLA-A*0201-binding peptides with known affinities, we employed this method, coupled with the sophisticated partial least squares regression (PLS) and genetic algorithm (GA)-variable selection, to develop several structure-based QSAR models for quantitatively predicting and interpreting the binding behavior of HLA-A*0201 with its peptide ligands. We also showed the structural and physicochemical implications for the HLA-A*0201–peptide binding.
Four hundred and seventy-three HLA-A*0201-restricted non-apeptides were collected from the AntiJen database (30), with the binding affinity IC50 used here assessed by the required doses of replacing 50% of radiolabeled standard peptide from the HLA-A*0201-standard peptide complex in given time. To ensure comparability between different IC50 values used here, only 419 samples with their standard peptide FLPSDYEPSV were selected from the initial 473 peptides. In which 138 are high-affinity peptides (IC50 ≤ 50 nm), 185 peptides have intermediate affinities (50 nm < IC50 ≤ 500 nm), and 96 are low-affinity peptides (IC50 > 500 nm). For modeling, IC50 values were converted into logarithmic activities log (109/IC50), where 109 guarantees that logarithmic activities range between 1 and 9. Tropsha et al. (31) highlighted the importance of external validation in QSAR study, so the whole peptide set was split into two groups according to the previous report (32); one served as a training set consisting of 300 peptides for constructing models and another a test set consisting of 119 for assessing the predictive power of the constructed models. The sequences and corresponding affinities are listed in Table S1 in Supporting Information.
We used a method that modified from Hou et al. (33) to construct the structure models of HLA-A*0201–peptide complexes. Briefly, according to the suggestion of Doytchinova et al. (22,23), the X-ray crystal structure of a viral non-apeptide TLTSCNTSV bound to HLA-A*0201 was used as the template complex (Figure 1), and this template was further virtually mutated to other HLA-A*0201–sample peptide complexes. Subsequently, energy minimization for each complex was carried out using AMBER (34). The maximum number of minimization steps was set to 1000; the first 250 steps were performed with the steepest descent algorithm, whereas the rest of the steps were performed with the conjugate gradient algorithm. In this procedure, the solvent effect was considered with the generalized Born model.
Classification of protein atoms
There are five standard elements as C, H, N, O, and S in proteins, and these elements exhibit diverse chemical properties when in different chemical environments and hybridization states. Previously, different investigators proposed different classification schemes for protein atoms to address specific problems. For instance, atoms were categorized into 17 types in terms of the solubility by Lazaridis and Karplus(35) to study protein solvation free energy, while 13 types in terms of the size by Tsai et al. (36) to analyze protein packing density and atomic volume. In this work, the protein–peptide interactions were determined by diverse atomic properties (including hydrophobicity, electrostatic character, and size property), so the atoms in 20 amino acids were classified into 16 types in terms of their physicochemical property and local environment (Table 1). According to this classification, it is evident that atom-pair types between the HLA-A*0201 receptor and peptide ligand are at most 256 terms, which corporately encode the detailed structural information about the interaction profile of HLA-A*0201 with peptide.
Table 1. Classification of the atoms in proteins and peptides into 16 types (also refer to Supporting Information, Table S2)
H attached to formally positive atoms
Hζ of Lys; Hε, Hη1 and Hη2 of Arg; Hδ1, Hδ2, Hε1 and Hε2 of His
H attached to nitrogen, oxygen, and sulfur atoms
H of backbone; Hδ of Asn; Hε of Gln; Hγ of Ser; Hγ2 of Thr; Hη of Tyr; Hε1 of Trp; Hγ of Cys
H attached to aromatic carbon
Hγ, Hδ1, Hδ2, Hε1, Hε2 and Hζ of Phe and Tyr; Hδ1, Hε3, Hζ2, Hζ3 and Hη2 of Trp
H attached to aliphatic carbon
Hα of backbone; Hβ, Hγ1 and Hγ2 of Val; Hβ, Hγ, Hδ1 and Hδ2 of Leu; Hβ, Hγ1, Hγ2 and Hδ of Ile; Hβ, Hγ and Hε of Met; Hβ of Phe, Tyr, Trp, His, Cys, Asn, Asp, Ala and Ser; Hβ and Hγ of Glu and Gln; Hβ, Hγ, Hδ and Hε of Lys; Hβ, Hγ and Hδ of Arg; Hβ and Hγ1 of Thr; Hβ, Hγ and Hδ of Pro
C in amide
C of backbone; Cγ of Asn; Cδ of Gln
C in ionized group
Cζ of Arg; Cγ, Cδ2 and Cε1 of His; Cγ of Asp; Cδ of Glu
C in aromatic ring
Cγ, Cδ1, Cδ2, Cε1, Cε2 and Cζ of Phe and Tyr; Cδ1, Cγ, Cδ1, Cδ2, Cε2, Cε3, Cζ2, Cζ3 and Cη2 of Trp
C directly bonded to nitrogen, oxygen, and sulfur atoms
Cα of backbone; Cγ and Cε of Met; Cε of Lys; Cδ of Arg; Cβ of Ser, Thr and Cys
C in aliphatic group
Cβ, Cγ1 and Cγ2 of Val; Cβ, Cγ, Cδ1 and Cδ2 of Leu; Cβ, Cγ1, Cγ2 and Cδ of Ile; Cβ of Phe, Tyr, Trp, His, Asn, Asp, Ala and Met; Cβ and Cγ of Glu and Gln; Cβ, Cγ and Cδ of Lys; Cβ and Cγ of Arg; Cγ2 of Thr; Cβ, Cγ and Cδ of Pro
N in amide
N of backbone; Nδ2 of Asn; Nε2 of Gln
N in ionized group
Nζ of Lys; Nε, Nη1 and Nη2 of Arg; Nδ1 and Nε2 of His;
N in aromatic ring
Nε1 of Trp
O in amide
O of backbone; Oδ1 of Asn; Oε1 of Gln
O in ionized group
Oδ1 and Oδ2 of Asp; Oε1 and Oε2 of Glu
O in hydroxyl group
Oγ of Ser; Oγ2 of Thr; Oη of Tyr
Sγ of Cys; Sδ of Met
It is well known that non-bonding effects play essential role in various biological processes, such as molecular recognition in drug–receptor, antigen–antibody, and enzyme–substrate complexes. Non-covalent interactions between receptor and ligand include, primarily, electrostatic interaction, van der Waals contact, and hydrophobic force, while other expressions (e.g., hydrogen bonding, charge transfer, and salt bridge) can be regarded as the special forms of these three interaction types.
Most atoms in proteins are packed sufficiently close each other to form transient van der Waals (vdW) attractions. The vdW packing energies at protein–peptide interface were determined with the Lennard-Jones 12–6 potential , where dij is the distance from protein atom i to ligand atom j, ℓij is the Lennard-Jones well depth, and Dij* is the distance at the Lennard-Jones minimum. The Lennard-Jones parameters between pairs of different atom types are obtained from the Lorentz–Berthelodt combination rule (37), and the atomic parameters were taken from AMBER (34).
Electrostatic potential between two charged atoms can be accurately described with the classical Coulomb’s law, in which the EP is proportional to atomic charges and is reciprocal of distance , where qi is the partial charge of atom i, ε0 is the distance-dependant dielectric constant. Here, the atomic charges defined in AMBER were used as the atomic partial charges (34).
Hydrophobic interaction is a central force driving protein folding and binding. The interatomic hydrophobic potential was experimentally demonstrated to be exponential distance-dependent and proportional with the intrinsic hydrophobicity of the interacting atoms (38). An empirical equation was used to quantitatively describe the hydrophobic potential
(39), where ρ is the Eisenberg’s atomic solvation parameters (40) and S is atomic solvent accessible surface area defined in MSMS program (41) with ProtOr radii (36).
Although hydrogen bond plays an important role in the specific biomolecular recognition, its dissociation energy is difficult to be described accurately using empirical approaches. In AMBER force field, the van der Waals’ and Coulombic terms were optimized to ‘co-reproduce’ the hydrogen bond potential, and we used this parameter set to parameterize the electrostatic and steric interactions involved in HLA-A*0201–peptide binding. Therefore, the hydrogen bond effect was indirectly considered in this study. In addition, the rotatable side chains of interfacial amino acids are frozen because of the HLA-A*0201–peptide binding, leading to a loss in systemic conformational entropy. However, the role of entropy in biological systems is quite elusive, and no effective methods are currently available to quantitatively measure its effect on biomolecular interactions. Therefore, in this study, the entropy’s contribution was ignored and we only investigated the direct interaction energies of HLA-A*0201–peptide systems.
A detailed description of atomic types, partial charges, hydrophobicities, and vdW parameters for each atom of the 20 standard amino acids can be found in Table S2 in Supporting Information.
Partial least squares regression (PLS)
The PLS regression algorithm consists of outer relations and an inner relation linking both blocks, X = TPT+F and Y = UQ + E, where matrixes T and U contain mutually orthogonal latent variable or scoring vector t and u which are a linear combination of original variables in matrixes X and Y, respectively. The t and u latent variables are correlated through the inner relation of û = bt (b is regression coefficient) that leads to the estimation of the y from the x. The detailed description of PLS algorithm can be found in Ref. (42)
The modeling procedure was performed as follows (Figure 2). (i) Definition of HLA-A*0201–peptide binding interface. Two residues respectively coming from the HLA-A*0201 receptor and the peptide ligand were considered in contact if there is at least one pair of non-hydrogen atoms in 6 Å between them (43). The binding interface was then defined as a region comprised by all contacting residues in receptor and ligand. (ii) Construction of atom-pair interaction matrices. Three 16 × 16 atom-pair interaction matrices were separately constructed to describe electrostatic, van der Waals, and hydrophobic interactions between the HLA-A*0201 and peptide at the interfacial region. Peptide atoms are put in the column, while HLA-A*0201 atoms in the row. By using the aforementioned non-bonding potentials, totally 3 × (16 × 16) = 768 elements (descriptors) involved in the three matrices were created to characterize the non-covalent profile at HLA-A*0201–peptide interface. (iii) PLS modeling. PLS was employed to linearly correlate these non-covalent descriptors with corresponding binding affinities.
Molecular force minimization was implemented with AMBER 9.0 package (44); Non-covalent interactions between HLA-A*0201 and peptide ligands were calculated using in-home program written in C++; MATLAB toolbox ZP-explore (45) were used to perform PLS and GA/PLS analysis.
Results and Discussion
Development and analysis of QSAR models
To investigate the potential relationship between different non-bonding effects and peptide-binding ability, we separately used electrostatic, steric, and hydrophobic descriptors and their combinations to construct QSAR models. The modeling results are listed in Table 2. As can be seen, those models constructed by the single non-covalent component (ME, MS, and MH) are incapable of effectively capturing hidden dependences in the peptide system, exhibiting a relatively inferior performance. With non-covalent terms increasing, model’s statistical quality was improved significantly. Among the three non-covalent effects, we did not find one which is more related with the binding affinity than others. It is revealed that the binding behavior of peptides with HLA-A*0201 is determined by diverse physicochemical contributions, and none of these contributions is dominant to this process.
Table 2. Statistics of the models constructed using different combinations of non-covalent types and modeling methods
bNLV, number of significant latent variables in PLS models.
dAssisted by GA-variable selection.
The optimal model ME+S+H was obtained by using all the three non-covalent components, it explains a large proportion of the variance in the binding affinity, with r2 = 0.621 and RMSE = 0.54 on the training set. However, fitness ability on the training set indicates none of the predictive ability of the model. To address this issue, the generalization ability on test set was examined, yielding results as rprd2 = 0.549 and RMSEP = 0.69. In Figure 3A, the binding affinities of training peptides well fit their experimentally observed values, and only a few samples are deviated largely. On the test set, the binding propensity from low to high was predicted properly, and it is confirmed the model ME+S+H is a reliable predictor for the HLA-A*0201-binding peptides. Figure 4 shows the scoring scatters for partial training peptides in the top two principal component (PC) spaces, in which most sample points fall within the Hotelling T2 ellipse of 95% confidence level except for four outliers. By examining sequence composition, we found the abnormal behavior of these outlier peptides in the PC spaces was caused by their own particular structures: the first three peptides VCMTVDSLV, RLLGSLNST, and SAICSVVRR presenting in the bottom-left corner of this plot include a single presence of a hydrophilic amino acid at their anchor site (Cys at the 2nd position of VCMTVDSLV, Thr at the 9th position of RLLGSLNST, and Arg at the 9th position of SAICSVVRR); the fourth peptide VLLLDVTPL includes many strongly hydrophobic and bulk residues such as Leu, Pro, Val and is thus located at the top-left corner – a sheer opposite region to the first three peptides.
Biomolecular interaction and recognition are usually governed by only a few sites or key residues and different properties of these key residues contribute distinctly to the binding, so variable selection is frequently used to extract the most informative descriptors from initial variable space in QSAR studies (46). Considering that genetic algorithm (GA) is a popular variable selection strategy that can well handle the high-dimensional data space and has been successfully applied in many scientific fields (47), we here combined the PLS with GA-variable selection to reconstruct our models for improving their statistical quality and interpretability. The selected variable subset of the optimal GA/PLS model consists of 218 variables (73 electrostatic terms, 52 steric terms, and 93 hydrophobic terms), from which five significant latent variables were extracted to explain 60.6% of the variance for dependent variable y (r2 =0.606) and to predict 57.6% of the variance for y by cross-validation (q2 = 0. 576). This model was further employed to predict the pIC50 values of test samples, explaining 57.7% of the variance for test y (rprd2 = 0.577). Figure 3B shows the fitted vs. experimental affinity values for 300 training samples and the predicted versus experimental affinity values for 119 test samples. As can be seen, most of sample points are distributed along the 45° line through the origin, and no obvious outliers were found in both the training and test sets. Figure 5 is the plot of observation distance to the model in the independent variable X space (DModX), and this plot intuitively illustrates the distance between the rebuilt positions using five significant latent variables and the original positions for 300 training samples. By the five latent variables, most training samples can also be well rebuilt in X space because the normalized distance to original X exceeds the critical value of 1.251 at 5% significance level. These results demonstrated that the GA-variable selection, although not enhancing the fitting ability of PLS model on training set, improved modeling stability and predictive power significantly, thus giving rise to a more robust predictor for this peptide data set.
Variable importance in the projection and one-way analysis of variance
To find the intuitive interpretation involved in the GA-selected variable pool, the 218 variables were employed to construct PLS model with peptide affinities on the basis of all the 473 non-apeptide samples. Consequently, five variables with highest variable importance in the projection (VIP) values were found out for further analysis (Table 3). The VIP is a straightforward measurement for the relative importance of original inputting variables in the outputting PLS model; a variable with a larger VIP value gives more significant contribution to the PLS model and thus more correlates with the binding affinity of peptides (42). As might be expected, two of the five are hydrophobic terms and of which the most significant one is the interaction of peptide aliphatic C with HLA aliphatic H. This is compatible with the notion that the hydrophobic force is the primary factor to derive HLA–peptide binding (48). The electrostatic factor seems to be also responsible for the binding, because the polar interactions of peptide charged C/H with HLA polar C/hydroxyl O hold a relatively large VIP score when compared to other terms. The electrostatic complementarity between HLA and its bound peptides has demonstrated to play an important role in assisting the precise positioning of the peptide ligands in MHC cleft (49) and could therefore be recognized as the secondary factor relative to the hydrophobic force. The last one is steric term that contributes a marginal effect to the binding, such as shape matching and interface packing at, in particular, the peptide positions P2, P3, and P9 with receptor residues Phe9, Met45, Val67, Tyr84, and Thr143 (50).
Table 3. The top five significant descriptors in the GA/PLS-selected variable pool and their variable importance in the projections (VIPs)
Description of the variable
Hydrophobic term of peptide aliphatic C with HLA aliphatic H
Electrostatic term of peptide charged C with HLA polar C
Hydrophobic term of peptide amide C with HLA aromatic C
Electrostatic interaction of peptide charged H with HLA hydroxyl O
Steric term of peptide aliphatic H with HLA amide N
To explore the independent variance associated with the peptide binding for the five significant descriptors, one-way analysis of variance was implemented to analyze these variables in the three-group peptides classified by different intervals of pIC50 values [weak binding: pIC50 < 6.301 (IC50 > 500 nm), intermediate binding: 6.301 ≤ pIC50 <7.301 (50 nm < IC50 ≤500 nm), strong binding: pIC50 ≥ 7.301 (IC50 ≤50 nm)]. These group sizes are 96, 185, and 138, respectively. Test of homogeneity of variances was firstly carried through. If it was homogeneity of variances, then Fisher analysis of variances was performed; or else, Browne–Forsythe analysis of variances was done. The F-test, homoscedasticity, significance level, degrees of freedom, and other statistics resulted from the analysis of variance for the five significant descriptors are listed in Table 4. Results showed the variances of these variables are all statistically significant. It is further confirmed that diverse non-covalent effects are co-contributed to the peptide binding. Investigations on the specificity difference between the weak, intermediate, and strong binding peptides and exploration of the magnitude of the difference would provide with valuable references for understanding the binding behavior of peptides with HLA-A*0201 (51).
Table 4. One-way analysis of variance of five most significant variables in the GA/PLS model
Test of homogeneity of variances
Robust tests of equality of means
adf1: Degree of freedom within groups.
bdf2: Degree of freedom between groups.
cSignificance level is 0.05.
Absence of non-covalent terms
We aim to encode the non-covalent interactions at the interfaces of HLA-A*0201–peptide complexes and use statistical modeling approach to correlate the encoded information with the binding affinities of these complexes. Therefore, accurate description of the non-covalent terms that govern the binding of HLA-A*0201 with its peptide ligands is fundamentally important for generating reliable QSAR models. In this study, however, only the electrostatic, van der Waals, and hydrophobic interactions are considered, whereas hydrogen bond and desolvation effect are not presented explicitly – albeit the hydrogen bond and desolvation effect could be, more or less, reflected in the combination of electrostatic, van der Waals, and hydrophobic terms. This discounting of hydrogen bond and desolvation effect may (or may not) lead to some unknown effects on the results of statistical modeling. For example, it could give rise to more fitting’s component for the yielded QSAR models and thus reduce the predictive power and interpretability of these models. However, these side-effects are difficult to be detected directly using current QSAR methodology, because (i) one would never to say that a model that performs well on a set of limited samples could also be reliable when it is used to predict all new samples, and (ii) one would never to say that a method that performs well on a set of given samples could also be feasible when it is used to solve all other problems. In this respect, instead of constructing a model to accurately predict HLA-A*0201–peptide-binding affinity, here we are more likely to provide an ‘idea’ for structure-based QSAR analysis of protein–peptide complexes – this ‘idea’ could be improved in future for practical purpose and also could shed light on the development of other QSAR methodologies.
A structure-based method was used to encode the non-covalent interaction profile at the binding interfaces of 419 HLA-A*0201–peptide complexes. In this procedure, atom-pair types between the receptor and ligand were classified into 256 types and then three non-bonding effects were separately calculated for each atom-pair, yielding 768 descriptors. Subsequently, QSAR models were constructed by using PLS coupled with or without GA-variable selection. Then, these models were validated rigorously and investigated systematically using external test and one-way analysis of variance. Results showed that electrostatic, van der Waals, and hydrophobic effects co-contribute to the HLA-A*0201–peptide binding. Particularly, diverse interactions between C and H atoms are the dominator to the binding.
This work was supported by the National Natural Science Foundation of China (Nos. C03030204 and 30770968).