Q3 (%) = percentage accuracy of the prediction; the standard deviation (within brackets) is computed assuming a binomial distribution of the assignments. C = correlation coefficient; Pc = probability of correct predictions (for the definitions of the statistical indices, see Appendix A).
Research Article
Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins
Article first published online: 1 OCT 1999
DOI: 10.1002/(SICI)1097-0134(19990815)36:3<340::AID-PROT8>3.0.CO;2-D
Copyright © 1999 Wiley-Liss, Inc.
Issue
1097-0134/asset/cover.gif?v=1&s=d817e79b67ba6cacf8bdcce1a819c04de300a7e3)
Proteins: Structure, Function, and Bioinformatics
Volume 36, Issue 3, pages 340–346, 15 August 1999
Additional Information
How to Cite
Fariselli, P., Riccobelli, P. and Casadio, R. (1999), Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins: Structure, Function, and Bioinformatics, 36: 340–346. doi: 10.1002/(SICI)1097-0134(19990815)36:3<340::AID-PROT8>3.0.CO;2-D
Publication History
- Issue published online: 1 OCT 1999
- Article first published online: 1 OCT 1999
- Manuscript Accepted: 8 APR 1999
- Manuscript Received: 7 JAN 1999
Funded by
- Italian Centro Nazionale per le Ricerche (target project Biotechnology)
- Ministero della Università e della Ricerca Scientifica e Tecnologica (project “Biocatalisi e Bioconversioni”)
- Abstract
- Article
- References
- Cited By
Keywords:
- multiple sequence alignment;
- neural networks;
- structure prediction;
- cysteine redox state prediction
Abstract
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS AND DISCUSSION
- CONCLUSIONS
- Acknowledgements
- REFERENCES
- Appendix A
- Appendix B
A neural network-based predictor is trained to distinguish the bonding states of cysteine in proteins starting from the residue chain. Training is performed by using 2,452 cysteine-containing segments extracted from 641 nonhomologous proteins of well-resolved three-dimensional structure. After a cross-validation procedure, efficiency of the prediction scores were as high as 72% when the predictor is trained by using protein single sequences. The addition of evolutionary information in the form of multiple sequence alignment and a jury of neural networks increases the prediction efficiency up to 81%. Assessment of the goodness of the prediction with a reliability index indicates that more than 60% of the predictions have an accuracy level greater than 90%. A comparison with a statistical method previously described and tested on the same database shows that the neural network-based predictor is performing with the highest efficiency. Proteins 1999;36:340–346. © 1999 Wiley-Liss, Inc.
INTRODUCTION
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS AND DISCUSSION
- CONCLUSIONS
- Acknowledgements
- REFERENCES
- Appendix A
- Appendix B
The tertiary folds of native proteins are defined by a large number of weak interactions such as hydrogen bonding, hydrophobic interactions, salt bridges, and weakly polar interactions. In addition to these noncovalent forces, certain proteins are also stabilized covalently by disulfide bridges formed by uniquely paired cysteine residues in the folded state. Reduction of disulfide bridges triggers functionally relevant conformational changes.1
The contribution of the disulfide bridge to the thermodynamic stability of proteins has been described as being due to a reduction of the conformational entropy of the unfolded polypeptide chain causing a destabilization of the unfolded state relative to the native state (for review, see Ref. 2), and it can be both experimentally3, 4 and theoretically estimated.5 Several analyses of the characteristics of disulfide bridges in proteins have been performed, including structural and sequence features and classification of connectivity (see Ref. 6 and references therein). The disposition of cysteine residues relative to each other and relative to protein secondary structure is important in the classification of the structure of small disulfide-rich irregular proteins.7
Few studies have addressed so far the important problem of predicting the bonding state of cysteine in a protein chain. The correct prediction of this state can help in predicting ab initio the three-dimensional structure of proteins by adding structural constraints. The relevance of the flanking residues in predicting a cysteine bonding state has been demonstrated by using statistical8 and neural network-based methods9 with a much smaller database than we used in the present study.
We train and test with a cross-validation procedure a neural network system on a database of 2,452 cysteine-containing segments (34% of which contains half cystines) to distinguish between bonded and unbonded cysteine. The effect of evolutionary information not taken into account before is also investigated.
METHODS
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS AND DISCUSSION
- CONCLUSIONS
- Acknowledgements
- REFERENCES
- Appendix A
- Appendix B
The Database
Two thousand four hundred fifty-two segments containing cysteines (free and disulfide bonded [half cystines]) were taken from the crystallographic data of the Brookhaven Protein Data Bank. Disulfide bond assignment was based on the Define Secondary Structure of Proteins (DSSP) program.10 Nonhomologous proteins (with an identity value <25%) were selected by using the PDB―select―oct―97 algorithm (http://www.embl-heidelberg.de). Segments that in some residue position lack the corresponding atomic coordinates in the PDB file and segments whose cysteines are interchain disulfide bonded are not included in the database. After this filtering procedure, the total number of examples out of 641 proteins was 2,452, 842 of which were in the disulfide-bonded state and 1,610 of which were in the nondisulfide-bonded state. The PDB codes of the proteins whose cysteine-containing segments are included in the database are listed in Appendix A.
The Neural Network-Based Predictor
Standard feed-forward neural networks are implemented with a back-propagation algorithm as learning procedure.11 The network architecture consists of a perceptron without hidden layers, with two output nodes (discriminating the disulfide and free cysteine propensities, respectively). A cysteine residue, flanked by symmetrical segments of different length (from three to eight residues) is classified as disulfide bridge-forming or free, depending on the relative values of the network output neurons. To compensate for the disproportion between disulfide bond forming and free cysteines during the training phase, learning was accomplished by means of a procedure including a balancing probability factor to reduce the number of back propagation cycles for the most abundant class.12 This was performed to minimize overprediction of free cysteines. Because of the limited number of examples presently available, an early learning stopping procedure is used to train the networks.12
Eight different input codings to the networks (N) are considered. One is based on single-sequence input (NSS), and the remaining seven are based on multiple-sequence profile (NMS). In the former case, only the cysteine flanking residues are taken into consideration by removing the cysteine from the center of the input windows, of variable length from 7 to 17 residues. This procedure is similar to that previously adopted,9 and it is used to simplify the computation of the network junctions. Indeed, being cysteine always present in the central position of the segment, it does not carry any information. With single-sequence input, each residue is encoded as a vector of 21 elements, with all elements set to 0 but one, set to 1, whose position in the vector identifies the particular residue type. Twenty elements encode for the 20 amino acids, and the last one provides a signal when the input window overlaps either the C or N terminus of the protein.
When evolutionary information is presented to the network, coding is performed by using a sequence profile for the cysteine- containing segments taken from the HSSP files of the corresponding proteins.13 Also in this case, each residue is encoded by a 21-element vector as in the former case, with the difference that each of the first 20 elements represents the frequency of residue in the sequence alignment. When a multiple- sequence profile is used, the central cysteine is taken into account. This is based on the observation that cysteines can be more or less conserved in the profile.
Alternative input codings based on multiple-sequence profile (Network Multiple Sequence, NMS) are:
NMS+C (Charge): this network adds to the multiple- sequence profile explicit information regarding the charges in the neighboring sequence environment in the form of two more input neurons for each residue in the window. These neurons depending on the amino acidic charge are set respectively to 1 and 0 for positive, 0 and 1 for negative, and 0 and 0 for noncharged residues.
NMS+H (Hydrophobicity): this network uses as input a matrix based on the hydrophobicity profile. Elements of the matrix are the values contained in the multiple sequence profile derived from the HSSP files13 and multiplied by the hydrophobicity value of each residue. For this, we used different scales, but no significant differences have been observed and the data presented refer to the Rose's hydrophobicity scale.14
NMS+WE (conservation Weight and relative Entropy): in this network two more neurons are added for each residue in the input window. One accounts for the conservation weight, and the other represents the relative entropy of each position in the multiple-sequence profile as computed by MaxHom and present in the HSSP files.13
NMS+WEC: this network combines the input described for NMS+C and NMS+WE.
NMS+WEH: this network combines the input described for NMS+H and NMS+WE.
NMS+WECH: this network combines the input described for the three networks NMS+C, NMS+H and NMS+WE.
Statistical evaluation of the predictor efficiency is scored by computing the network accuracy (Q3), the correlation coefficient (C), and the probability of correct predictions (Pc) both for the disulfide bridge-forming and free cysteines (Pc(SS) and Pc(SH), respectively) (for the definition of the statistical indices see Appendix B and also Fariselli et al.12).
The predictor is validated with a cross-validation procedure, which is performed by splitting the whole set of segments of the database into 20 subsets containing an approximate equal number of examples (with the same proportion of disulfide bridge-forming and free cysteines). One subset at the time is removed from the training set and used as testing set. Identity between the segments of the training and testing sets is carefully kept ≤30%. For the evaluation of the statistical indices, the predictions of the 20 different networks are summed up, and the standard deviation for the network accuracy (Q3) is computed, assuming a binomial distribution of the assignments.
RESULTS AND DISCUSSION
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS AND DISCUSSION
- CONCLUSIONS
- Acknowledgements
- REFERENCES
- Appendix A
- Appendix B
The Predictor at Work
In Table I the results obtained with the neural network-based predictor using single sequence as input are listed depending on the window length presented to the network. It is evident that the discriminating capability of the predictor between the two different bonding states of cysteine is as high as 72% with a 13-residue-long window. This is also confirmed by the values of the other statistical indices computed to evaluate the network performance. The results confirm the observation that locally surrounding amino acids greatly influence cysteines in forming disulfide bridges. With the neural network-based predictor the local environment-dependent features are best discovered when cysteine is centered in a 13-residue-long segment.
| Window length | Q3 (%)a | C | Pc (SS) | Pc (SH) |
|---|---|---|---|---|
| ||||
| 7 | 69.3 (0.9) | .33 | .63 | .72 |
| 9 | 70.4 (0.9) | .38 | .64 | .73 |
| 11 | 71.4 (0.9) | .40 | .66 | .73 |
| 13 | 71.8 (0.9) | .41 | .67 | .74 |
| 15 | 71.1 (0.9) | .40 | .66 | .73 |
| 17 | 70.0 (0.9) | .35 | .60 | .75 |
Different types of network architectures were also tested, including networks with a hidden layer comprising from 2 to 6 neurons. This did not improve the predictor efficiency (data not shown) compared with that obtained with the perceptron without hidden layers. Indeed, the generalization capability of the network was progressively decreased by the increasing number of hidden neurons in the hidden layers, as previously noticed for small training sets as the one used in this study. An exploration of the effect of the number of examples presented to the perceptron without hidden layers (with a 13-residue-long window) (data not shown) indicates that when the training set is 50% and 75% reduced, the efficiency scored as network accuracy is 68% and 67%, respectively. However, the correlation coefficient values are drastically reduced to 0.34 (for the 50% reduced training set) and to 0.27 (for the 65% reduced training set), indicating that the network tends to perform similarly to a random predictor (C = 0). These results indicate that the size of the database used in the present work (and presently available) can be considered a lower limit to start with for predicting the bonding state of cysteine in proteins with neural networks.
It has been clearly shown that the evolutionary information embodied in a sequence profile can significantly improve protein structure predictions. This has been demonstrated for the prediction of secondary structures,15 of solvent accessible surface,16 and of transmembrane helices in membrane proteins.17, 18 In this work, evolutionary information is provided to the networks to test whether the discriminating capability between disulfide bond-forming and free cysteines is also affected by residue conservation in the local environment.
Using multiple-sequence alignment as input to the networks, the efficiency of the predictor improves by 6% (Table II). This is confirmed also by the increase of the values of the correlation coefficient and of the probability of correct predictions. Moreover, a new interesting result is that the performance of the prediction is independent of the window size ranging from 11 to 17 residues. It can be concluded that with the number of examples in the database, the addition of evolutionary information saturates the network performance already with an 11- residue-long window and that increasing the window size neither deteriorates nor diminishes the generalization capability of the system.
Alternative Input Codings
It is well known (and in it has confirmed also in the previous paragraph) that the form of the input coding plays a key role in the neural network performance.15–18 In this respect we have tried to increase the input information content by using alternative codings. The first takes into account the charges in the cysteine environment, the second adds also the hydrophobicity profile, and the third uses two more input neurons for each residue belonging to the input window, representing the conservation weight and the relative entropy of the residue position in the multiple sequence alignment.13 Furthermore, three other input codings combining those described above, are also implemented. In Table III the best results obtained with these alternative codings with input window lengths ranging from 17 to 21 residues are listed. It is evident that neither the addition of information of the charge nor that of the hydrophobicity profile of the cysteine environment improves the network performance compared with that obtained by using as input the multiple-sequence profile alone (Table II). However, an improvement of 2% is obtained when the conservation weight and the relative entropy are explicitly taken into account. A graphical depiction of the weights from each amino acid at each window position averaged over the 20 different training sets, including the border condition, the entropy value, and the conservation weight is shown in Figure 1, both for the disulfide (A) and the nondisulfide nodes (B). In the plots, scaled to a shade of black, dark squares indicate positive weights (strong propensity for the bonded (A) and nonbonded (B) states), and light ones indicate negative weights (weak propensity for the bonded (A) and nonbonded (B) states). The inspection of the weight values under the conditions of maximal network performance highlights the following: (a) the presence of cysteine residues in the environment of the central cysteine strongly favors the disulfide bond formation, with the exception of positions ± 3. This is in agreement with the fact that metal binding cysteines are typically found in proteins in position i and i ± 3; (b) hydrophilic and/or charged residues in the environment are highly conducive toward disulfide bond formation compared with hydrophobic residues that are poorly conducive; (c) the entropy is lower for the environment of the nonbonded cysteines; and (d) the cysteine in the central position is highly conserved for the disulfide bond-forming cysteines.
| Methoda | Q3 (%) | C | Pc (SS) | Pc (SH) |
|---|---|---|---|---|
| NMS + C | 78.3 (0.8) | 0.51 | 0.71 | 0.81 |
| NMS + H | 78.3 (0.8) | 0.51 | 0.70 | 0.82 |
| NMS + WE | 79.8 (0.8) | 0.55 | 0.71 | 0.84 |
| NMS + WEC | 79.9 (0.8) | 0.56 | 0.70 | 0.84 |
| NMS + WEH | 80.2 (0.8) | 0.55 | 0.71 | 0.84 |
| NMS + WECH | 80.1 (0.8) | 0.55 | 0.71 | 0.84 |
| JURY | 81.0 (0.8) | 0.57 | 0.72 | 0.85 |
Figure 1. Graphical representation of the values of the weight junctions averaged over the 20 networks used (values are scaled to a shade of black). Junctions are between the input window of 11 residues (shown along the vertical axis) and the network output. A: Propensity for the disulfide-bonding state; B: Propensity for the nonbonding state. Labels of the residues (single letter code) are placed horizontally; 0 represents the border condition; & and # represent the entropy and the conservation weight, respectively.
1097-0134(19990815)36:3<340::AID-PROT8>3.0.CO;2-D/asset/image_t/tfig001.gif?v=1&t=gymzk8o2&s=69a7cc602fa3676a9516c0505fe47324888537ed)
As previously shown by other authors,15, 19 a jury of networks improves the prediction accuracy. This is so also when a jury of the six different networks previously described is used and the performance increases to 81%.
Our predictive method is also evaluated by measuring the reliability of the prediction. This can be estimated by computing the reliability index that relates the absolute value of the difference between the two output values of the network with the efficiency of the prediction (Q3).15–18 In the case of the jury, the values are averaged among the different concurring networks. In Figure 2, the accuracy of the prediction, together with the percentage of the examples in the database whose prediction is characterized by a certain value of the reliability index, are plotted as a function of the value of the reliability index itself. When the jury of networks based on evolutionary information is used, more than 60% of the cysteine containing segments of the database is predicted with an accuracy of 90%.
Comparison With Other Methods
To compare our predictor with a statistical method previously described,8 we implemented the same method using our database. This is promoted by the fact that a direct comparison of our results with those reported in Reference 8 is hampered by the different and smaller database previously used. The statistical method is based on the compilation of two matrices representing the statistical frequencies for the residues along the cysteine flanking regions for both disulfide bond-forming and free cysteines. These frequency matrices are used to compute another matrix (MR), whose elements are taken as the ratio of each position of the frequency matrices of the disulfide bond-forming and free cysteines.8 Provided that the segment whose central cysteine state is to be predicted is not included in the compilation of the frequency matrix, it is possible to evaluate the prediction of the cysteine-bonding state by computing the product of each element in MR associated to each residue in the segment.8 If the product is >1, the cysteine in the segment is predicted to form a disulfide bridge; otherwise, the cysteine in the segment is assigned to the free cysteine class.
After a cross-validation on the database, the results obtained with the statistical method are compared with those obtained with the neural network approach (Table IV). For comparison, the efficiency obtained by using single sequence and the jury based on multiple-sequence inputs are also shown. It is evident that the neural network-based method scores higher than the statistical one.
| Method | Q3 (%) | C | Pc (SS) | Pc (SH) |
|---|---|---|---|---|
| MR | 68.1 (0.9) | .36 | .50 | .81 |
| NSS | 71.8 (0.9) | .41 | .67 | .78 |
| JURY | 81.0 (0.8) | .57 | .72 | .85 |
Unfortunately, a direct comparison of our predictor with that previously described, also based on neural networks and segment single sequence, is not possible because of a lack of cross- validation in the testing procedure adopted by the authors.9
CONCLUSIONS
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS AND DISCUSSION
- CONCLUSIONS
- Acknowledgements
- REFERENCES
- Appendix A
- Appendix B
In this study we show that a neural network-based predictor is capable of discriminating the disulfide bridge-forming potential of cysteine residues by weighting the effect of the local environment with and without evolutionary information. The results indicate that the jury of neural networks using as input sequence profile perform well (with an efficiency equal to 81%, which is 16% higher than that obtained with a random predictor) and that the accuracy of the prediction for the 60% of the database used is extremely good (90%). The neural network system scores higher than a statistical method implemented and tested with a cross-validation procedure on the same database. The neural network predictor here described can therefore provide a useful tool for protein modeling and protein engineering.
The software is available upon request from the authors.
Acknowledgements
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS AND DISCUSSION
- CONCLUSIONS
- Acknowledgements
- REFERENCES
- Appendix A
- Appendix B
This work was supported partially by a grant for a target project in Biotechnology from the Italian Centro Nazionale per le Ricerche (C.N.R.) and by a grant from Ministero della Università e della Ricerca Scientifica e Tecnologica (MURST) delivered to the project “Biocatalisi and Bioconversioni.”
REFERENCES
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS AND DISCUSSION
- CONCLUSIONS
- Acknowledgements
- REFERENCES
- Appendix A
- Appendix B
- 1
- 2. Disulfide bonds and the stability of globular proteins. Protein Sci 1993; 2:1551–1558. MedlineDirect Link:
- 3. Structural thermodynamics: prediction of protein stability and protein binding energy. Arch Biochem Biophys 1993; 303:181–184. Medline
- 4
- 5
- 6, . Analysis and classification of disulfide connectivity in proteins. J Mol Biol 1994; 244:448–463. Medline
- 7
- 8, , , . Different sequence environment of cysteines and half cystines in proteins. FEBS Lett 1992; 302:117–120. Medline
- 9, , . Prediction of the disulfide-bonding state of cysteine in proteins. Protein Eng 1990; 3:667–672. Medline
- 10, . Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983; 22:2577–2637. MedlineDirect Link:
- 11
- 12, , . Predicting secondary structures of membrane proteins with neural networks. Eur Biophys J 1993: 22:41–51. Medline
- 13, . Database of homology-derived structures and the structural meaning of sequence alignment. Proteins 1991; 9:56–68. MedlineDirect Link:
- 14
- 15, . Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 1994; 19:55–72. MedlineDirect Link:
- 16, . Conservation and prediction of solvent accessibility in protein families. Proteins 1994; 20:216–226. MedlineDirect Link:
- 17, , , . Prediction of helical transmembrane segments at 95% accuracy. Protein Sci 1995; 4: 521–533. Medline
- 18, , . Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 1996; 5: 1704–1718. MedlineDirect Link:
- 19
Appendix A
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS AND DISCUSSION
- CONCLUSIONS
- Acknowledgements
- REFERENCES
- Appendix A
- Appendix B
1097-0134(19990815)36:3<340::AID-PROT8>3.0.CO;2-D/asset/image_n/ntbl005.gif?v=1&t=gymzk8sq&s=549a3d7ff0ae3d34edfbc12ee6b524d0a2497f63)
Appendix B
- Top of page
- Abstract
- INTRODUCTION
- METHODS
- RESULTS AND DISCUSSION
- CONCLUSIONS
- Acknowledgements
- REFERENCES
- Appendix A
- Appendix B
In this study the efficiency of the predictors is scored by using three different indices. The network accuracy is defined as:
(1A)
where P is the total number of correct predictions and N is the total number of possible predictions.
Because only two classes (the bonding and the nonbonding state of cysteine) are discriminated, the correlation coefficient C is single valued and defined (for the bonding or nonbonding state) as:
(2A)
where p and n are the total number of correct predictions and that of correctly rejected assignments, respectively for one state; u and o are the numbers of under and over predictions for same state.
The probability of correct predictions Pc is evaluated as:
(3A)
where for the state s (bonding and non bonding) p(s) and o(s) are the numbers of correct and over predictions, respectively.

1097-0134/asset/PROT_centre.gif?v=1&s=77b56b1f2cdaba74cb3bb149bd9b029cd8803cdb)
1097-0134(19990815)36:3<340::AID-PROT8>3.0.CO;2-D/asset/image_t/tfig002.gif?v=1&t=gymzk8oe&s=85bce12150e504f53b5f2ea0bdd8946d6be50928)