Encoding of the peptide amino acid sequence.
Several types and combinations of sequence encodings were used in the neural network training. The first is the conventional sparse encoding where each amino acid is encoded as a 20-digit binary number (a single 1 and 19 zeros). The second is the Blosum50 encoding in which the amino acids are encoded as the Blosum50 score for replacing the amino acid with each of the 20 amino acids (Henikoff and Henikoff 1992). Other Blosum encoding schemes were tried and we found that all encodings with Blosum matrices corresponding to a clustering threshold in the range 30–70% gave comparable performance. In the following we will use the Blosum50 matrix when we refer to Blosum sequence encoding. A last encoding scheme is defined in terms of a hidden Markov model. The details of this encoding are described later in section 3.3. The sparse versus the Blosum sequence-encoding scheme constitutes two different approaches to represent sequence information to the neural network. In the sparse encoding the neural network is given very precise information about the sequence that corresponds to a given training example. One can say that the network learns a lot about something very specific. The neural network learns that a specific series of amino acids correspond to a certain binding affinity value. In the Blosum encoding scheme, on the other hand, the network is given more general and less precise information about a sequence. The Blosum matrix contains prior knowledge about which amino acids are similar and dissimilar to each other. The Blosum encoding for leucine has, for instance, positive encoding values at input neurons corresponding to isoleucine, methionine, phenylalanine, and valine and negative encoding values at input neurons corresponding to, for instance, asparagine and aspartic acid. This encoding helps the network to generalize; that is, when a positive example with a leucine at a given position is presented to the network, the parameters in the neural network corresponding to the above similar and dissimilar amino acids are also adjusted in a way so that the network appears to have seen positive examples with isoleucine, methionine, phenylalanine, and valine and negative examples with asparagine and aspartic acid at that specific amino acid position. This ability to generalize the input data is of course highly beneficial for neural network training when the number of training data is limited. The use of Blosum sequence encoding might, on the other hand, even in situations where data are not a limiting factor, be an important aid to guide the neural network training, simply because the Blosum matrix encodes a subtle evolutional and chemical relationship between the 20 amino acids (Thorne et al. 1996).
Neural network training.
The neural network training is performed in a manner similar to that described by S. Buus, S.L. Lauemøller, P. Worning, C. Kesmir, T. Frimurer, S. Corbet, A. Fomsgaard, J. Hilden, A. Holm, and S. Brunak, in prep.), especially with respect to the tranformation applied to the measured binding affinities before doing the network training, and the procedure used for the balanced training of the neural network.
We develop the method with optimal predictive performance in a two-step procedure. In the first round the method is optimized on a subset of 428 of the 528 peptides in the Buus data set, and its performance is evaluated on an independent evaluation set of the remaining 100 peptides. In this manner we minimize the risk of over-fitting. In the second round the method is retrained on the full set of data using the parameter settings obtained in the first round.
The test and training of the neural networks is performed using a fivefold cross-validation by splitting the 428 peptides into five sets of training and test data. The splitting is performed such that all test and training sets have approximately the same distribution of high, low, and nonbinding peptides. The training data are used to perform feed-forward and back-propagation and the test data, to define the stopping criteria for the network training as described by Baldi and Brunak (2001).
The performance of the neural networks is measured using the Pearson correlation coefficient on the test set (Press et al. 1989).
The neural network architecture used is a conventional feed-forward network (Baldi and Brunak 2001) with an input layer with 180 neurons, one hidden layer with 2–10 neurons, and a single neuron output layer. The 180 neurons in the input layer encode the 9 amino acids in the peptide sequence with each amino acid represented by 20 neurons. The back-propagation procedure was used to update the weights in the network.
We transform the measured binding affinities as described by S. Buus, S.L. Lauemøller, P. Worning, C. Kesmir, T. Frimurer, S. Corbet, A. Fomsgaard, J. Hilden, A. Holm, and S. Brunak, in prep. to place the output values used in the training and testing of the neural networks on a scale between 0 and 1. The transformation is defined as 1-log(a)/log(50,000), where a is the measured binding affinity. In this transformation high binding peptides, with a measured affinity stronger than 50 nM, are assigned an output value above 0.638, intermediate binding peptides, with an affinity stronger than 500 nM, an output value above 0.426, and peptides, with an affinity weaker than 500 nM, an output value below 0.426. Peptides that have an affinity weaker than 50,000 nM are assigned an output value of 0.0.
Because the distribution of binding affinities for the peptides in the training and test sets is highly nonuniform, with a great over-representation of low and nonbinder peptides, it is important that the network training is done in a balanced manner. This is done by partitioning the training data into a number of N subsets (bins) such that the ith bin contains peptides with a transformed binding affinity between (i − 1)/N and i/N. In balanced training, data from each bin are presented to the neural network with equal frequency.
For each of the five training and test sets, a series of network trainings were performed each with a different number of hidden neurons (2, 3, 4, 6, 8, and 10) and a different number of bins (1, 2, 3, 4, and 5) in balancing of the training. For each series, a single network with the highest test performance was finally selected.