Prediction of protein domain boundaries from inverse covariances

It has been known even since relatively few structures had been solved that longer protein chains often contain multiple domains, which may fold separately and play the role of reusable functional modules found in many contexts. In many structural biology tasks, in particular structure prediction, it is of great use to be able to identify domains within the structure and analyze these regions separately. However, when using sequence data alone this task has proven exceptionally difficult, with relatively little improvement over the naive method of choosing boundaries based on size distributions of observed domains. The recent significant improvement in contact prediction provides a new source of information for domain prediction. We test several methods for using this information including a kernel smoothing-based approach and methods based on building alpha-carbon models and compare performance with a length-based predictor, a homology search method and four published sequence-based predictors: DOMCUT, DomPRO, DLP-SVM, and SCOOBY-DOmain. We show that the kernel-smoothing method is significantly better than the other ab initio predictors when both single-domain and multidomain targets are considered and is not significantly different to the homology-based method. Considering only multidomain targets the kernel-smoothing method outperforms all of the published methods except DLP-SVM. The kernel smoothing method therefore represents a potentially useful improvement to ab initio domain prediction. Proteins 2013. © 2012 Wiley Periodicals, Inc.

such contacts are typically examined (preliminary testing revealed that the method was not very sensitive to this parameter).
The method uses the contact data to estimate the relative density of contacts which would be broken if a domain cut is predicted to occur at each residue. Thus we maintain a counter for each residue in the sequence and for each predicted contact all residues in the closed interval with the predicted contacting residues as the endpoints have their counters incremented. Local minima in this profile are then likely to indicate domain boundaries assuming that the domains are similar in size and contact themselves more than they contact each other.
The raw profile produced from this is very noisy, (see figure 1 in the main text) and it can be difficult in general to distinguish genuine minima from spurious minima. Smoothing using kernel density estimation (KDE) is a principled, widely-used way to deal with this problem. Intuitively the idea is that the information for a count at a particular residue is not restricted only to that residue but also its neighbours. Technically this is done by changing the discrete counts to the sum of Gaussian distributions centred on each position. The standard deviation of each Gaussian (which is known as the bandwidth parameter in KDE) determines how far along the chain the information is propagated: large bandwidth leads to profiles which are nearly flat while small bandwidth leads to profiles barely less noisy than those in the original discrete case.
Finally the smoothed profile can be used to estimate cut points by looking for local minima in the profile. In this method the value of the derivative of the curve is estimated at each point k as Local minima are then identified as points where the derivative changes sign from negative to positive. All such minima are predicted as cut-points.

Model Reconstruction Methods
The three methods based on previously published domain parsers (WT, Islam, PDP) are identical except for the specific domain parsers used. The input in each case is a list of predicted contacts sorted best-first as described above. Each method then has two parameters: N, the number of predicted contacts to use in model reconstruction and D, the distance threshold which FT-COMAR optimizes each pair of contacts to. Both of these parameters apply only to the domain reconstruction; although the domain parsers do in some cases have their own parameters these were not varied in the experiments.
Predicted contacts are converted to the matrix format required by FT-COMAR and then this method is run with the resulting matrix and the D parameter specified. Neighbour filtering is not used. The resulting model is then parsed for domains with the appropriate domain parsing method.
Where an ensemble prediction was made the FT-COMAR method was run 100 times) and predictions were made for all models. The NDO between each pair of predictions was assessed and the prediction with the highest mean NDO against all other predictions was taken as the ensemble prediction. This method was tested on the WT method only and found not to significantly improve predictions.

Homology based method
The homology-based predictor used HMMER3.0 to search the CATH Gene3D HMM library (v 3.2). A profile of endpoints of alignments to these domains was constructed by summing counts at each sequence location, following which the profile was smoothed by taking average values over 7-residue windows across the whole sequence. This smoothing procedure was repeated five times to remove artefacts created by spurious hits. Using statistics taken from CATH (3.2) we found the mean single-domain length to be 130 with a standard deviation of roughly 70. Sequences shorter than the mean length were automatically predicted to be single domain. Cut points were then edited to remove those <50 residues away from either terminus.

Domain Parser Comparison
Although it is not central to the questions of this study we were interested to determine whether the three parsers tested performed similarly when given the real structures. This is useful because it can provide some notion of whether good performance on real structures is related to good performance on the crude models. The three methods were run on the structures in the Bourne set and the NDO scores for all targets and multidomain targets only were compared using Wilcoxon signed-rank tests for paired data as described in the main Methods section. Raw data are shown in table S1. None of the methods were found to be significantly different in their domain parsing performance.  Figure S1: Parameter testing using real contact data. The kernel smoothing parameter was tested using optimal AMISE smoothing and a variety of functions of the input sequence length (see Methods). For each function and parameter, the mean single-domain NDO score (x-axis) is plotted against the mean multi-domain NDO score (y-axis), calculated over the Bourne set with real contacts as inputs . Points along the Pareto-front are labelled with parameter values.

Consensus Predictions and Distances
Since FT-COMAR uses a stochastic method the model generated in a particular run will differ to some extent (sometimes radically) from models generated on other runs. We investigated using this information to improve the structural domain parser results by generating an ensemble of 100 structures and finding the consensus prediction over the models, the similarity between predictions being assessed using the normalized domain overlap (see below). Additionally we re-estimated contacts by finding the proportion of models in which each residue pair was found below a threshold distance. We then used the closest N of these as predicted contacts for input into the four contact-based predictors and determined whether these produced better results for the KDE and structural methods.
Constructing models from predicted contacts by distance geometry can help to resolve inconsistencies resulting from noise in the predictions. We therefore expected that the mean distances found in the ensemble of models could improve the accuracy of the predicted contacts and potentially improve domain predictions. For low thresholds the AUC values improve overall but in many cases the predictions are made worse. Increasing the threshold to 20Å mitigates this problem without reducing the improvement for other cases substantially. Once the threshold reaches 30Å the AUCs are reduced for the majority of targets.
There are likely to be two reasons that we see this effect: firstly the resolution of inconsistencies as described above and secondly the reintroduction of some of the proximities which would appear as "chains" of contacts, the removal of which is the motivation for the DI method. To test the latter we determined the effect of changing the minimum sequence separation in the AUC calculation from 5 to 10 and observed that we obtained the same result, suggesting that there is some real improvement.
We then assessed whether the improvement was transferred to the accuracy of the domain predictions for the contact-based methods but found that the effect on most methods was to significantly decrease prediction accuracy for both single and multidomain targets. The KDE method did not significantly change although mean accuracy on multidomain targets was marginally higher. Figure S2: Contact prediction improvement by ensemble re-estimation: receiver operator characteristic (ROC) scores for contact predictions are plotted for raw predictions (xaxis) and after re-estimation from 100 FT-COMAR models (y-axis). The y = x line is plotted to assist interpretation. Data are shown for the Bourne set only.