Local co-occurrence scores of amino acid pairs
Although can provide comprehensive information about the amino acid pair preferences, the number of combinations is large (i.e., 400 amino acid pairs times the number of possible separations), and thus it is difficult to obtain general trends about amino acid sequences from these scores. To decrease the number of combinations and to focus on the local sequence preference, we averaged the CoS values over 1–10 residue separations, and defined the local CoS (LCoS) as
Positive values of LCoS indicate that the two specified residues are abundant in the local proximity up to 10 residue separations. We will first discuss the abundance and the absence of the 400 residue pairs by using LCoS, and then we will observe the preferences at each separation, by using CoS for several notable pairs or groups of pairs.
Figure 1 illustrates the LCoSs for 400 amino acid pairs. In this figure, we roughly divided the pairs into three groups: hydrophobic–hydrophobic (HH, left), polar–polar (PP, middle), and hydrophobic–polar (HP, right), under the assumption that the following nine residues (A, C, M, I, L, V, F, Y, and W) were hydrophobic, and the others were polar. For a more detailed discussion, the 20 amino acids were color-coded according to their physicochemical properties: the nine hydrophobic residues are colored green, the positively charged residues (H, K, and R) are blue, the negatively charged ones (D and E) are red, G and P are cyan, and the noncharged polar residues (S, T, N, and Q) are magenta.
Figure 1. Local co-occurrence scores (LCoS) of the 400 amino acid pairs. The pairs are evenly divided into 40 columns on the x-axis and arranged such that the HH, PP and HP pairs are sorted from left to right. The y-axis indicates the LCoS value of each pair. The amino acids are color coded, as described in the text. The identical residue pairs are highlighted with arrows, and the regions enclosed by dashed or dotted lines are discussed in the text.
Download figure to PowerPoint
The figure revealed the general trend that the frequencies of the HH pairs and the PP pairs are decreased from their expected values (i.e., negative LCoSs), whereas most of the local HP pairs are found more frequently than expected (positive LCoSs). The average and standard deviations of LCoS values of the HH, PP, and HP pairs were −0.033 ± 0.030, −0.017 ± 0.028, and +0.019 ± 0.029, respectively, and the three groups showed significantly different distributions from each other (P < 0.01 by Wilcoxon test for all combinations of distributions). If we simulated a data set of random protein sequences by permutating randomly the residues in each protein and calculated the LCoS value for each amino acid pair from the randomized sequence data set, the average and standard deviations of LCoSs of the HH, PP and HP pairs were much smaller, −0.0004 ± 0.0006, −0.0002 ± 0.0003, and −0.0002 ± 0.0002, respectively, indicating that most of the observed LCoS values are significant in the native protein sequences. These results indicated that the mixing of hydrophobic and polar residues was favored, whereas the clustering of residues from the same group was avoided, in the amino acid sequences of the “Ordered” proteins during the course of evolution.
The suppression of the local occurrences of HH pairs may imply a strategy for preventing protein aggregation. Since protein aggregation is related to various pathological conditions, the sequence mutations that enhance the features to promote aggregation will be a disadvantage for survival, and thus be suppressed in evolution.[27, 28] For example, amino acid sequences reportedly have safeguards against aggregation, by placing residues that prevent aggregation, such as arginine, lysine, and proline, in the sites flanking long stretches of hydrophobic residues.[29, 30]
The contribution of each hydrophobic residue to the decrease in local HH pairs was estimated by comparing the distribution of LCoSs of those HH pairs which includes at least one residue of the corresponding residue type. Since we defined nine residues as hydrophobic, each residue was included in 17 HH pairs, consisting of 8 × 2 = 16 heterogeneous pairs and one identical pair, and the median LCoS of these 17 pairs was compared. As results, isoleucine, (−0.057), phenylalanine (−0.052), tryptophan (−0.052), and leucine (−0.045) strongly contributed to the trend of decrease in HH pairs, indicating that strong hydrophobicity is responsible for the decrease in HH pairs in local proximity.
In a similar way, the decreased occurrence of local PP pairs represents an important feature of proteins with stable structures, when we focus on the consistency with the sequence features of intrinsically disordered proteins. The disordered regions are characterized by a low content of hydrophobic residues (or a high content of polar and charged residues),[32, 33] and low complexity due to highly repeated residues.[34, 35] These features are consistent with our observation of the suppressed local clustering of polar and charged residues in structured proteins.
The reason why the local proximity of HP pairs is increased in the amino acid sequences may be to compensate for the suppressed HH and PP pairs, but protein structures may have local features that promote the mixing of H and P amino acids. For example, regular secondary structures, such as helices and strands, have hydrophobic and polar residues that periodically appear in turn, in small globular proteins. Another example is the supersecondary structure, in which regular secondary structure elements (SSEs) with hydrophobic residues are followed by a loop region with polar residues.
Structural backgrounds of frequent HH and PP pairs
Despite the general trend of negative LCoS values for HH and PP pairs, some had positive LCoSs. As shown in Figure 1, most of the HH pairs with positive LCoSs contained alanine or cysteine. These residues have relatively low hydrophobicity, as compared with the other hydrophobic residues, which may explain their different trends from the other typical HH pairs. On the other hand, the PP pairs with positive LCoSs included various types of residues. The sequence–structure relationship of these exceptional PP pairs is interesting, because they may play important roles in the stability of protein structures.
First, two histidines (Column 8, LCoS 0.066) is the most favored pair among the PP pairs. They are reportedly included in many local motifs, such as zinc fingers, and their important roles in functional sites explain their exceptional abundance in protein structures.
Among the pairs between the remaining four charged residues (D, E, R, and K), the eight pairs of opposing charges were found significantly more frequently than those of the same charge (Fig. 1, Columns 8–11 and 14, dashed rectangles for different charge pairs and dotted ones for the same charge pairs, P < 0.01 by Wilcoxon test). This observation is physically reasonable, because counter-charged pairs are electrically favorable. However, it is quite interesting that the opposing charged pairs had directionality in protein structures. In other words, the negative charge (D and E) frequently occurred at the N-terminus, whereas the positive one (R and K) was at the C-terminus (Fig. 1, Columns 9–11, dashed rectangles. For example, see the difference between EK and KE). A close inspection of these residue pairs, using CoSs from 1 to 10 sequence separations, revealed that these pairs frequently appeared at 1, 3, 4, and 7 separations [Fig. 2(A)], which suggested the periodicity of α-helices. In α-helices, the charged residues reportedly have a distinctly asymmetric distribution, due to the interaction between the charged side chains and the helix dipole moments; glutamic acid and aspartic acid are favored near the N-termini to interact with the main-chain nitrogens and lysines, and arginines are favored near the C-termini to interact with the main-chain oxygens.[9, 10] Thus, α-helices may impose an ordered arrangement of charged residues on protein sequences.
Figure 2. The co-occurrence propensities for (A) EK, KR, and their counterparts, (B) PE and EP, and (C) GS, GT, and their counterparts. The amino acid pairs in the opposing directions (e.g., EK and KE) are depicted by a solid line (the more frequent one) and a dashed line (the less frequent one) with the same color.
Download figure to PowerPoint
We also found other types of pairs with a directional preference; that is, PE (Column 13, dashed circle) had a positive LCoS of 0.023, whereas EP (Column 12, dashed circle) had a large negative LCoS of −0.102. The analysis of the CoS values at each separation revealed that PE was more favored than EP at any distance from 1 to 10, with the largest difference occurring at a separation of 1 [Fig. 2(B)]. This difference may be explained in terms of the terminal motifs of an α-helix. Although proline cannot form regular backbone hydrogen bonds in helices, it is strongly favored as either the first residue of the helix N-terminus or the residue just after the C-terminus, whereas on the other hand, glutamic acid is favored inside an α-helix near the N-terminus, as described. Thus, proline was frequently followed by glutamic acid, but the inverse pair was observed less, as a result of the combination of the two known sequence preferences.
GT and GS are another example of frequent PP pairs (Fig. 1, Column 17, dashed circle). They are much more favored than their opposing counterparts, SG (Column 18) and TG (Column 20). The analysis of these pairs using CoSs revealed that for most of the separations from 2 to 10, GT and GS occurred more frequently than TG and SG [Fig. 2(C)]. When secondary structures were assigned to these pairs, the majority of these residue pairs were involved in loop regions. Thus, the GT and GS pairs may play an important role in the stability of loop regions, as compared to their opposing counterparts, although their atomic details have not been clarified.
Suppression of identical pairs in ordered proteins
In the data set of soluble, ordered proteins, 12 of the 20 identical pairs had negative LCoSs, with a median LCoS value of −0.021. The suppression of the identical pairs, however, may not necessarily be observed in protein sequences that are not guaranteed to fold into stable structures.
To address this point, we constructed four additional protein sequence data sets, for comparison to the “Ordered” sequences. (1) The “Disordered” data set was obtained from the DisProt database. It consists of 653 protein sequences that were demonstrated to be intrinsically disordered; that is, not to form stable structures. (2) The “UniRef” data set is a nonredundant subset from the UniProt database. It consists of 3,939,736 nonredundant protein sequences, among which each two share at most 50% sequence identity. Datasets (3) “Human” and (4) “Escherichia coli” include 34,421 and 4190 protein sequences of each species, respectively, which were obtained from the RefSeq database.
Figure 3 illustrates the boxplots of the LCoSs for the 20 identical pairs, which were calculated from the five data sets. The results revealed that in the “Disordered” sequences, the identical pairs were increased, except for tryptophan, whereas cysteine, methionine, glycine, and proline, along with the charged residues, had large LCoSs, whereas the other hydrophobic residues, such as I, L, Y, F, and W, had smaller values. These high frequencies of local identical pairs result from the repetitive and low complexity regions included in disordered sequences.[35, 36] The “UniRef” data set also showed increased occurrences of identical pairs. A protein sequence in this data set can consist of a mixture of ordered and disordered regions. In such a sequence, the hydrophobic residues and polar residues will be concentrated in the ordered and disordered regions, respectively, and thereby the local co-occurrences of both types of identical pairs become increased. A similar trend was observed in the “Human” data set, in which about 40% of the sequences are predicted to contain long disordered regions. In contrast, the fractions of disordered regions in bacterial proteins are reportedly low, and the LCoSs of identical pairs in the “E. coli” data set were significantly lower than those of the other three sequence data sets. These results indicated that the suppression of the same residues is strongly related to the foldability of the amino acid sequences.
Figure 3. The LCoSs of the 20 identical pairs in the five data sets, depicted by boxplots. The thick solid lines indicate the median LCoSs. The top and bottom of each box indicate the first and third quartiles, respectively. The whiskers extend to the most extreme data point, which is no more than 1.5 times the interquartile range from the box.
Download figure to PowerPoint
Effects of SSEs
SSEs, such as helices and strands, have periodic patterns of hydrophobic and polar residues. These patterns lead to the periodicity of the side chain orientations in these elements: residues with similar hydrophobicity appear at each three or four residues for helix segments and at each two residues for strand segments. To evaluate the effect of the SSEs on the residue co-occurrence in the amino acid sequence, the “Ordered” data set was further divided into all-α, all-β and α/β subsets, as follows. Secondary structures were determined by using the DSSP program. We calculated the α-fraction of a protein, as the ratio of the number of residues in helices (DSSP code “H,” “G,” or “I”) and the total number of residues in either helices or sheets (DSSP code “E” or “B”). The proteins with an α-fraction ≥0.8 were classified as all-α, whereas those with an α-fraction ≤0.2 were classified as all-β. Other proteins were classified as α/β. Overall, we obtained 1478 all-α, 977 all-β, and 4913 α/β proteins.
We divided the 400 residue pairs into three groups, according to the rough classifications shown in Figure 1, thus generating 81 HH (green), 121 PP (red), and 198 HP (black) pairs. Figure 4(A–D) shows the average CoSs of the three groups, as functions of residue separation. The results of the entire data set [Fig. 4(A)], which included all 7368 proteins regardless of their SSE contents, showed a strong oscillation of the average CoSs for each group, as a function of separation. This oscillation indicated a strong effect on the entire data set from the α-helix segments, in which residues with similar characteristics appear at separations of 3, 4, and 7. Nevertheless, HH pairs, as well as PP pairs, are essentially avoided in the local sequence, as shown by the LCoS values (dashed lines), which indicate the effects of similar pair suppression in the local proximity. As expected, we observed similar patterns for the α/β proteins, because the data set is the mixture of α helices and β sheets [Fig. 4(D)].
Figure 4. Secondary structure-dependent co-occurrence scores (CoSs). The four graphs illustrate, from top to bottom, the results for (A) the entire data set, (B) the α-protein subset, (C) the β-protein subset, and (D) the α/β protein subset. The CoSs averaged over the residues in the HH, PP, and HP groups are depicted by green, red, and black solid lines, respectively. The average LCoSs of the three groups are shown by gray dotted lines. (E) The Spearman's correlation of CoSs between α and β proteins. (F) The scatter plot of LCoSs between α and β proteins.
Download figure to PowerPoint
The α-protein subset [Fig. 4(B)] and the β-protein subset [Fig. 4(C)] revealed quite different CoS patterns. The results for the α proteins indicated stronger oscillations of CoS values compatible with an α-helix than those for the entire data set, and the favorable interactions of HH and PP pairs in forming helices seemed to overcome their tendency to be suppressed at separations of 3, 4, and 7 [Fig. 4(B)]. On the other hand, since hydrophobic and polar residues are frequently aligned one after another in β-sheets, we expected to observe the co-occurrence of HH and PP pairs in each even number of residue separations for β proteins, but the results showed that an increased occurrence was only observed at a 2 residue separation [Fig. 4(C)]. There are two possible explanations for the limited effect of β-sheets. One is the smaller fraction of residues forming SSEs in β proteins than in α proteins: the average fraction of helices in the α-protein subset is 65%, whereas that of strands in the β-protein subset is 48% (P < 0.001, by Wilcoxon test). The other explanation is that a long stretch of alternating patterns of hydrophobic and polar residues is not favored in a protein sequence, because such a region would cause aggregation.[43, 44] In short, the difference between the amino acid sequences of α and β proteins lies at the separations from 2 to 4, where the effects of the different local periodicities are clearly observed.
We then evaluated the effect of SSEs, by calculating the Spearman's correlation coefficient between α and β proteins using the CoSs of the 400 pairs at each separation [Fig. 4(E)]. As a result, the separations were divided into three groups, according to the correlation values: positive correlations at 1, 5, and 6, negative ones at 2, 3, and 4, and no correlation from 7 to 10. The positive correlations coincide with the separations where the HH pairs and the PP pairs were suppressed in both the α and β proteins. The negative correlations confirmed the previous result that α-helices and β-strands had opposing effects on residue co-occurrence at the 2, 3, and 4 separations. If these CoSs were averaged over 1–10 separations to yield LCoSs, then the correlation between α and β proteins would be 0.55, which is higher than the correlations between the CoS values at any separation. As shown in the scatter plot of the LCoSs for α and β proteins [Fig. 4(F)], there was a tendency for the HH and PP pairs to decrease and the HP pairs to increase, regardless of the SSE types of the proteins. Therefore, we concluded that the avoidance of the local proximity of HH and PP pairs is a general feature in the amino acid sequences of soluble, ordered proteins, regardless of the SSE content of each protein.