Protein Topology Prediction Algorithms Systematically Investigated in the Yeast Saccharomyces cerevisiae

Membrane proteins perform a variety of functions, all crucially dependent on their orientation in the membrane. However, neither the exact number of transmembrane domains (TMDs) nor the topology of most proteins have been experimentally determined. Due to this, most scientists rely primarily on prediction algorithms to determine topology and TMD assignments. Since these can give contradictory results, single‐algorithm‐based predictions are unreliable. To map the extent of potential misanalysis, the predictions of nine algorithms on the yeast proteome are compared and it is found that they have little agreement when predicting TMD number and termini orientation. To view all predictions in parallel, a webpage called TopologYeast: http://www.weizmann.ac.il/molgen/TopologYeast was created. Each algorithm is compared with experimental data and a poor agreement is found. The analysis suggests that more systematic data on protein topology are required to increase the training sets for prediction algorithms and to have accurate knowledge of membrane protein topology.


Introduction
It has been estimated in the literature that nearly 30% of all proteins in eukaryotic cells span membranes. [1] These proteins are essential for diverse functions such as transfer of molecules across membranes, signal transduction, organelle tethering and fusion, and a myriad of enzymatic activities. The structure, function, and localization of such proteins are strongly affected by the number and location of their transmembrane domains (TMDs), as well as the orientation of each of their residues that together comprise the protein topology in the membrane.
Hence, to better understand the function of any transmembrane protein, it is essential to understand its topology.

Experimental Approaches to Study TMD Number and Topology Using Systematic Tools Are Essential for Providing a Whole-Proteome View
A variety of biochemical techniques such as protease protection and cysteine scanning assays have been used for decades to map TMD number and topology in single proteins. [2,3] However, in recent years, several experimental approaches have been published that enable the study of TMD number and protein membrane topology in a systematic manner on tens to hundreds of proteins, or even in entire proteomes. These have been extensively reviewed [2] but just a few examples include the following.

High-Sensitivity Mass Spectrometry (MS) Coupled with Protease Protection Assays
In short, peptides that are facing the cytosol will be sensitive to protease treatment and hence will be lost from MS analysis. This method was recently used to assess yeast mitochondrial protein membrane topology. [4]

Genetic Reporter Tags to Follow the Termini of Membrane Proteins in Living Cells
For example, a glycosylatable green fluorescent protein (gGFP) tag that loses fluorescence when glycosylated in the lumen of the endoplasmic reticulum is a very powerful tool that has been developed for tracking orientation of a terminus for secretory proteins. [5] Another example of the use of reporter tags is the Suc2/His4C chimeric protein. The tag enables the following of glycosylation status (suggesting luminal orientation in the secretory pathway) since both the His4C and Suc2 peptides have N-linked glycosylation sites. Additionally, the His4C domain encodes for a histidinol dehydrogenase activity that converts histidinol into histidine, but only if located in the cytosol. This method was already successfully used to map the carboxy terminus (C′) orientation of hundreds of secretory proteins. [6] A third tagging approach was recently created based on a split Venus system. [7] A library was created with all yeast proteins containing one half of the split Venus reporter on their amino terminus (N′). To determine termini orientation, this library was mated with a strain expressing the other half of the Venus in the cytosol and their ability to interact was assayed by complementation of the full Venus fluorophore and emission of fluorescence, suggesting that the N′ is facing the cytosol. [8] While more methods are starting to become available, such systematic analyses are usually only performed on one terminus of a protein, on subsets of proteins rather than on entire proteomes, and often only in genetically pliable model organisms, such as yeast or bacteria, and not in other experimental systems. Hence, we are still missing robust experimental data on topology at a proteome-wide level for any organism.

Prediction Algorithms Based on Different Approaches and Training Sets Show Little Overlap in Their Output on the Presence and Number of TMDs
While systematic experimental data on protein topology are slowly becoming available, it is still not present for most organisms. In cases where experimental data are not available, prediction programs are widely used to estimate how many TMDs a protein has (TMD prediction), and the direction in which its N′ and C′ face (termini orientation). Over the years, many different prediction algorithms have been created, each following a different approach and using varying training datasets.
To assay how robust the different TMD prediction algorithms are, we used the protein-coding genome of the yeast Saccharomyces cerevisiae (from here on called yeast) as a test case. The yeast proteome contains 5800 proteins (excluding dubious genes and pseudogenes) and has the most experimental data on protein topology and hence is a good model for such analyses. We compiled the predictions on the presence and number of TMDs predicted for all yeast proteins using all TMD prediction algorithms that could be used in batch running for the entire proteome and that also gave topology predictions: HMMTOP, [9] TMHMM, [1] Topcons, [10] Octopus, [11] Philius, [12] Polyphobius, [13] Scampi, [14] Spoctopus, [15] and MEMSAT-SVM. [16] All nine algorithms are widely used and rely on different methods (such as Hidden Markov Models, Artificial Neural Networks, Support Vector Machines, Bayesian Networks, or combinations of methods) (see Table 1 for more details) and are trained on several protein properties, such as chemical traits, distributions, or predicted structures of amino acids (aa). Topcons combines the results from several algorithms (Octopus, Philius, PolyPhobius, Scampi, and Spoctopus).
When viewing the results from the nine algorithms, it was immediately apparent that even in the simple question of whether a protein is membrane spanning or not, each of the prediction programs provides very different answers. The results of the nine algorithms varied between 20% and 42% of all proteins in the proteome being membrane spanning (Table S1, Supporting Information) ( Figure 1A). Specifically, different algorithms gave conflicting assignments on whether even one TMD exists for over 2107 proteins (Table S1, Supporting Information). Only 2761 were consistently predicted to be soluble proteins and only 930 were consistently predicted by all nine programs to have at least one TMD, although in many cases, each program predicted a different number of TMDs ( Figure 1B). When looking at the number of TMDs predicted for each protein, we found that for only 245 proteins (≈4.2% of the proteome) did all nine programs agree ( Figure 1C). This observation raises the concern that utilizing a single prediction algorithm most likely has a very low probability of really predicting whether a protein spans the membrane at all and, if so, the number of times it does so.

A New List of Predicted Tail-Anchored (TA) Proteins Based on All TMD Prediction Algorithms Pooled Together Suggests New Proteins in This Family
The location of TMDs across the aa sequence of a protein influences its structure and function. Often very extreme TMDs at either the N′ or C′ are harder to accurately predict. For example, a TMD at the very N′ may be confused with a highly hydrophobic signal peptide (SP) that does not form part of the Table 1. Names and information on all prediction algorithms used in this manuscript.

Prediction algorithm
Algorithm type Reference HMMTOP Hidden Markov Model Tusnady and Simon [9] TMHMM Hidden Markov Model Krogh et al. [1] Topcons Consensus profile composed of five algorithms (Octopus, Philius, Polyphobius, Scampi, Spoctopus) and then a Hidden Markov Model on the profile Bernsel et al. [10] Octopus Artificial neural network + Hidden Markov Model Viklund and Elofsson [11] Philius Dynamic Bayesian Network Reynolds et al. [12] Polyphobius Hidden Markov Model Kall et al. [13] Scampi Hidden Markov Model Bernsel et al. [14] Spoctopus Artificial Neural Network + Hidden Markov Model Viklund et al. [15] MEMSAT-SVM Support Vector Machine Nugent and Jones [16] www.advancedsciencenews.com www.bioessays-journal.com mature protein but rather is only used to direct the protein into the secretory pathway and later cleaved off. A single TMD at the C′ of a protein is the hallmark of TA proteins that are anchored to all intracellular membranes facing the cytosol, thus enabling their cytosolic domains to carry out crucial functions. Due to their unique topology, distinct targeting and translocation pathways have evolved to cater to them. [17] Despite intense research on TA proteins, the entire repertoire of yeast TA proteins has not been fully identified or verified. This may be due to the fact that any predictions to date only included a subset of prediction algorithms. Hence, we sought to predict all yeast TA proteins using the nine algorithms mentioned above. We defined a TA protein as one that does not have any N′ targeting motif (SP or mitochondrial targeting sequence [MTS]) [8] and harbors a single TMD at the C′ (no further than 80 aa from the last residue). Based on these criteria and agreement between at least six prediction algorithms (67% agreement), we predicted 78 proteins that could be defined as TA proteins with high confidence (Table S2, Supporting Information). Out of these, only 31 have previously been assigned as TA proteins [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35] and 47 are newly predicted TA proteins (19 additional TA proteins were missed by our threshold and would have been captured had we lowered the threshold of high confidence). One hundred and ninety-eight additional proteins were predicted to have a TA by at least one prediction method. Hence, many additional TA proteins may exist and this remains to be experimentally verified.
More generally, our analysis shows that using several prediction algorithms in parallel raises the chance of capturing an entire topological family. Twenty-nine of the newly identified TA proteins are uncharacterized proteins with no known function. Since TA proteins perform important regulatory functions in cells, it may provide a clue as to the functions of these new members of this topological family.

Termini Orientation Predictions Show Little Robustness
Uncovering the membrane topology of an integral membrane protein is an essential step in determining its structural and functional properties. While assigning TMD www.advancedsciencenews.com www.bioessays-journal.com presence and number is already fairly inaccurate (as discussed above), an additional complicated task is to define termini orientation, i.e., whether the termini are facing the cytosol (in) or the lumen of an organelle/the extracellular space (out). To date, most assays for determining termini orientation at the single protein level rely on laborious biochemical techniques and not many systematic assays have been performed in organisms other than yeast. As a result, prediction algorithms are heavily utilized by protein researchers. As in the case of TMD prediction, termini orientation results can vary from one prediction algorithm to the other. Moreover, for some proteins, matters are further complicated as their termini can reside at either side of the membrane while still supporting function. [36] Indeed, looking at how our various prediction programs define an "in" orientation, we find very little agreement in their prediction for either N′ or C′ (Figure 2A,B). How can we know which one of the predictions is right? Or which prediction algorithm should be trusted in most instances? Studies for determining C′ orientation have, to date, only been performed on subsets of proteins and hence a simple systematic comparison was not possible. To collect as many datapoints to our comparisons, we compiled the above-mentioned data from the Suc2/His4C chimeric protein assays. [6] To these datapoints, we added existing data, not originally intended for termini orientation assignments. Specifically, we added protein-protein interaction data from large datasets from three types of protein complementation assays (PCAs) performed with C′ tagging. In these assays, one half of a reporter protein is attached to the C′ of a query protein and the other half of the reporter protein to the C′ of another. If one protein is cytosolic and one is spanning a membrane, then the only way that the two tagged termini can interact and form the fully complemented assay protein (giving rise to a measurable phenotype) is if the tagged terminus is facing the cytosol. Hence, these data can be used as a proxy for termini orientation even though the assay was not created for this purpose but rather to assay protein-protein interactions.
The dihydrofolate reductase (DHFR) PCA reporter confers resistance to the cytostatic drug methotrexate. [37] We found three studies that carried out such analysis and from them deduced the C′ orientation for 430 yeast proteins. [37][38][39] A second PCA approach is based on the split-ubiquitin system. [40] In this approach, the protein interaction enables cleavage of a ubiquitin, releasing a transcription factor activating transcription of the HIS3 reporter gene and enabling growth in medium lacking histidine. We compiled the data from ten different studies that utilized this system and deduced the C′ orientation for an additional 210 proteins. [40][41][42][43][44][45][46][47][48][49] Finally, we compiled the data from three different studies that utilized the split Venus approach on C′ tagged proteins and compiled information on an additional 112 proteins. [50][51][52] Most proteins were only represented in one of these datasets; thus, there is minimal overlap. In total, we compiled the topology assignments from 16 PCA experiments performed on C′ tags, spanning a total of 15 843 independent interaction datapoints (Table S3, Supporting Information). Comparison of the experimental data from the C′ topology and PCA experiments with the topology predictors enabled us to ask which ones correctly assigned termini orientation in most cases ( Figure 2C).
Since the N′ orientation was systematically assayed for yeast TMD proteins (as discussed above), we compared the experimental data [8,53] to the predictions ( Figure 2D).
For both termini, it appears that nearly half of all proteins did not show agreement with most algorithms (Figure 2C,D). This makes it difficult to determine a specific algorithm that outperforms the rest.
Finally, to assess which prediction algorithm is the most accurate at the level of whole-protein topology (orientation of termini and number of TMDs), we compared the outputs of all prediction algorithms to the experimental data on protein topology discussed above. We coupled the C′ and N′ orientation experimental data to determine whether the number of TMDs should be odd or even (if both termini are facing the same direction, TMD number should be even) ( Figure 2E) as well as whole-protein topology ( Figure 2F). We then compared this to the various algorithms focusing on 1014 yeast proteins predicted to have a TMD by at least five algorithms. In general, most TMD prediction programs had a similar, nonsatisfactory, chance of compatibility with the experimental data, though each one with a different roster of proteins ( Figure 2E,F) (Table S2, Supporting Information).
Taking the N′ and C′ termini prediction comparisons together, we can suggest that either all algorithms do not predict termini orientation well or that the experimental/ computational analyses that were done to date to define terminus orientation are not satisfactory. Nonetheless, more systematic datasets on experimental evidence are rapidly needed to assess the superiority of any of the algorithms or to increase the learning sets of prediction algorithms, enabling them to increase their accuracy and sensitivity of prediction.

Conclusions and Outlook
Our study of nine widely utilized prediction algorithms suggests that, at present, prediction programs still fall short of providing a strong platform for studying protein TMD number and topology. Additionally, our analyses suggest that in cases where experimental proof does not exist, a hybrid computational and experimental prediction might be the best approach for the prediction of TMD number and topology.
Due to the high diversity and disparity between programs, we decided to create an easy platform to simplify the visualization of the available data (predictions as well as experimental) utilized in this manuscript. To this end, we integrated these data into a database, called TopologYeast. [54] Our findings make a strong argument for the importance of further gathering of topology information using systematic approaches for entire genomes. High-throughput approaches and the resulting large datasets are increasingly becoming available, but currently only a small portion of the information that they contain is utilized to understand and explore complicated biological phenomena. We believe that many questions, such as termini orientation and TMD number, can be answered from some of these data. More generally, any www.advancedsciencenews.com www.bioessays-journal.com www.advancedsciencenews.com www.bioessays-journal.com additional information will be of wide use to the community of scientists working to understand protein functions.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.