SEARCH

SEARCH BY CITATION

Keywords:

  • Saccharomyces cerevisiae;
  • protein–protein interaction;
  • systematic analysis;
  • protein interaction map;
  • functional prediction

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Functional prediction of open reading frames coded in the genome is one of the most important tasks in yeast genomics. Among a number of large-scale experiments for assigning certain functional classes to proteins, experiments determining protein–protein interaction are especially important because interacting proteins usually have the same function. Thus, it seems possible to predict the function of a protein when the function of its interacting partner is known. However, in vitro experiments often suffer from artifacts and a protein can often have multiple binding partners with different functions. We developed an objective prediction method that can systematically include the information of indirect interaction. Our method can predict the subcellular localization, the cellular role and the biochemical function of yeast proteins with accuracies of 72.7%, 63.6% and 52.7%, respectively. The prediction accuracy rises for proteins with more than three binding partners and thus we present the open prediction results for 16 such proteins. Copyright © 2001 John Wiley & Sons, Ltd.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Since the completion of sequencing the Saccharomyces cerevisiae genome, the focus of yeast genomics has been shifted to the elucidation of the function of all gene products (Hieter et al., 1997). A number of systematic experiments are performed for such purposes (Martzen et al., 1999). However, such experiments do not directly determine a specific function of each gene product; e.g. a systematic experiment may determine that a gene is essential for its survival but this is only a clue for further experimental study. In this sense, systematic experiments of functional genomics should be considered as a means for screening interesting genes. These experiments can also include a certain ratio of errors. Therefore, computational analyses that can complement the results of systematic experiments are of great value. For example, a computer programme that can detect known sequence features of protein sorting signals will strengthen the experimental observation (Nakai and Horton, 1999). In addition to the standard homology searching, several computational methods for functional prediction of proteins have been proposed. They include the domain fusion method (Marcotte et al., 1999a) and the phylogenetic profile method (Pellegrini et al., 1999), both of which are based on the comparative genomics. Although such methods will give us interesting clues, their reliability is not satisfactory high. Additional methods that can refine the experimental results with higher accuracy would be desirable. In this paper, we introduce a prediction method of protein function from the experimental data of protein–protein interaction.

Among a variety of systematic experiments in the post-sequencing era, protein–protein interaction experiments (e.g. two-hybrid analysis and co-immunoprecipitation) are of great interest because interacting proteins are likely to collaborate on a common purpose. Thus, as in the case of sequence alignment, it is possible to deduce the function of a protein when the function of its binding partner is known. Large amounts of interaction data have been produced using the two-hybrid experimental technique on bacteriophage T7 (Bartel et al., 1996) on Drosophila cell cycle regulators (Finley et al., 1994) and on yeast (Micheline et al., 1997; Uetz et al., 2000; Ito et al., 2000). In an attempt to use such data for comprehensive prediction of protein function in silico, Eisenberg's group used the data from 500 interactions in their combined algorithm (Marcotte et al., 1999b). However, the predictability of protein function from protein–protein interaction data has not been objectively assessed. Since the two-hybrid experiments inevitably include false positives to some extent, an objective assessment will be necessary when we use protein–protein interaction data for functional prediction (Tirode et al., 1997; Brachmann et al., 1997; El Housni et al., 1998). In addition, it will also be necessary to formulate the prediction method when a protein has multiple binding partners with different functions. In addition, the possibility of including information from indirect interaction for prediction should be also sought. Here, we report how accurately we can predict a protein's function from its interaction data.

Materials and methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Protein function data and protein–protein interaction data

Three categories of yeast protein ‘function’, according to the classification used in the Yeast Proteome Database (YPD), were used: the ‘subcellular localization’, the ‘cellular role’ and the ‘biochemical function’ (Costanzo et al., 2000). ‘Subcellular localization’ refers to a specific site in a cell where a protein is located and consists of 22 members, e.g. ‘Golgi’ and ‘cytoplasmic’. The ‘cellular role’ denotes a major biological process of a protein, such as ‘DNA repair’ or ‘signal transduction’, and consists of 41 members. The ‘biochemical function’ describes a structural, regulatory, or enzymatic function, such as ‘structural protein’, ‘transcription factor’, ‘protein kinase’ or ‘DNA-binding protein’ and consists of 57 members. It can be said that the ‘cellular role’ indicates the function of a protein as a unit of a group, while the ‘biochemical function’ indicates its individual function. The January 2000 version of YPD included 2817 (46.0%), 3476 (56.6%) and 3056 (49.8%) proteins with labelled ‘subcellular localization’, ‘cellular role’ and ‘biochemical function’, respectively.

2112 distinct physical interaction data were collectively obtained from the websites of MIPS (http://www.mips.biochem.mpg.de) (Mewes et al., 2000) and CuraGen (http://portal.curagen.com) (Uetz et al., 2000), and from the supplementary material of the paper by Ito et al. (2000; http://www.pnas.org). The actual numbers of interaction data contributed from MIPS, CuraGen and Ito et al. were 1172, 890 and 139, respectively.

Method for prediction of protein function

In the preprocessing step, the physical interaction data (Figure 1A) were integrated into a protein interaction map (Figure 1B), where each node represents a protein and each edge represents the interaction between two proteins. Next, the function of each protein in the map (black circle in Figure 1C) is predicted, based on the functions of ‘n-neighbouring proteins’, which are defined as a set of proteins reached via n physical interactions at most (n is an integer parameter). For example, in Figure 1C, all proteins enclosed by the inside dashed circle are ‘1-neighbouring proteins’, and those enclosed by the outside circle are ‘2-neighbouring proteins’. The protein of interest is assigned the function with the highest χ2 value among functions of all n-neighbouring proteins. For each member of the function category, the χ2 value is calculated using the following formula:

  • equation image

where i denotes a protein function, e.g. ‘Golgi’, ‘DNA repair’ and ‘transcription factor’, ei denotes an expectation number of i in n-neighbouring proteins expected from the distribution on the total map, and ni denotes an observed number of i in n-neighbouring proteins. Then, the function of a query protein is predicted to be the function i with the maximum χ2 value. When there are multiple functions with the largest χ2 value, both functions are assigned. The optimal n value is determined by a so-called self-consistency test, where the predicted functions of all proteins in the map are compared with their annotated functions for each n.

thumbnail image

Figure 1. Overview of the prediction method. White circles represent proteins and the black circle represents a query protein for which function is predicted. (A) Physical interaction data deposited in the public databases. (B) Construction of the protein interaction map by integrating all physical interaction data. (C) Assignment of function to a query protein. This is done based on the functions of neighbouring proteins on the map. For further explanation, see text

Download figure to PowerPoint

For comparative uses, prediction results using randomly-assigned functional category were also calculated. In one trial, the assignment of function for each protein was changed to another, conserving the size distribution of functional categories. The prediction results for 100 trials were then averaged.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Inspection of data

It is quite natural that two interacting proteins are localized at the same subcellular compartment. Therefore, we first tabulated the localization sites of interacting proteins (Table 1). For 1220 interactions, the subcellular localization sites of both proteins were known, and for 843 (69.1%) of these interactions, both proteins share the same localization site. Many of other interactions may be false positives, although cytosolic proteins will interact with the membrane proteins at each compartment. Each protein has 2.4 binding partners, on average (data not shown). For the other definitions of protein function, in 951 out of 1489 interactions (63.9%), proteins share the same cellular role, while 528 out of 1145 interactions (46.1%) relate the same biochemical function.

Table 1. Number of physical interactions for each combination of subcellular localization sites
Subcellular localizationAbb.bncwspcpcserevexgolymcmtncovpopmsvum
Bud neckbn7003270000001600410
Cell wallcw 00100000000100000
Centrosome/spindle pole bodysp  1910710001005200200
Cytoplasmiccp   792980036071568627411
Cytoskeletalcs    62000210226111922
Endoplasmic reticulumer     2210450327200303
Endosome/endosomal vesiclesev      003200340001
Extracellularex       00000200000
Golgigo        26700690033
Lysosome/vacuolely         901390607
Microsomal fractionmc          00000000
Mitochondrialmt           322500302
Nuclearnc            5291123122
Other vesicles of the secretory/endocytic pathwaysov             190423
Peroxisomepo              11001
Plasma membranepm               1944
Secretory vesiclessv                21
Unspecified membraneum                 8

Self-consistency test

In Figure 2A–C, the prediction accuracy, assessed by the self-consistency test, of three kinds of ‘protein function’ with various n values is shown. Dummy prediction results using the randomized function assignments are also shown for reference. The easiest was the prediction of subcellular localization; its maximum value was 72.7% with 1-neighbouring proteins. The accuracy was always over 50%, regardless of the n value. The second was the prediction of cellular role; its maximum value was 63.6% with 1-neighbouring proteins. However, the prediction accuracy of biochemical function was rather low: 52.7% with two-neighbouring proteins.

thumbnail image

Figure 2. Result of self-consistency test. The horizontal axis represents the distance from a query protein (the n value) and the vertical axis represents the percentage prediction accuracy (data shown in filled diamonds). For reference, prediction results with randomly-assigned functions conserving the size distribution are also shown (in filled circles). (A) prediction of subcellular localization. (B) prediction of cellular role. (C) prediction of biochemical function

Download figure to PowerPoint

To see whether the prediction accuracy becomes higher for proteins with more binding partners, proteins were divided into two groups: 1245 proteins with one or two partners and 983 proteins with three or more partners. For the former group, the prediction accuracies of subcellular localization, cellular role and biochemical function were 70.3% (n=1), 58.8% (n=1) and 47.7% (n=2), respectively. On the other hand, the accuraries for the latter group were 80.8% (n=2), 84.3% (n=1) and 65.1% (n=2), respectively. Figure 3 shows both the distribution of the number of binding partners and the dependence of prediction accuracy on this number for three kinds of functional prediction. With smaller numbers of binding partners, the accuracy grows slightly as the partner increases.

thumbnail image

Figure 3. Frequency of the number of binding partners and the dependency of prediction accuracy on that number. The horizontal axis represents the number of binding partners. The bar chart shows the distribution of the number of binding partners, with the right vertical axis showing their frequency. The line graphs show the prediction accuracy (%) with the left vertical axis

Download figure to PowerPoint

Confirmation of the validity of the methodology

To test whether the use of heterogeneous data sources and experimental techniques affect the prediction accuracy, we examined the prediction accuracy for the cellular role as a typical example, using major sources or techniques individually (Table 2). It can be seen that the MIPS data, which is the largest, are favourable compared with the other two. On the contrary, if we compare the results using various techniques, major techniques contributed to the total amount of data are inferior in prediction accuracy. In Table 2, prediction accuracy for additional cases is also given: cases in which the proteins are redundantly referenced in more than two data sources; cases in which the binding partners are reciprocally detected; and cases in which the neighbouring proteins are interconnected. In all cases, the prediction accuracy improves. In the cases of redundant data, prediction accuracies were 75.0%, 78.2% and 62.4% for subcellular localization, the cellular role and biochemical function, respectively. In the cases of interconnected neighbours, prediction rates for three kinds of prediction are 84.4%, 86.6% and 69.9%, respectively. Therefore, it seems clear that redundantly referenced proteins or proteins whose neighbours are interconnected are more accurately predicted.

Table 2. Effect of using heterogeneous data sources and experimental techniques
Data#Pairs%Pairsn=1n=2n=3n=4n=5
  1. #Pairs and %pairs represent the number of pairs and percentage of pairs, respectively, for each subclass of major data sources and techniques. The subsequent columns represent the prediction accuracy (%) for each n value.

Ito1376.551.342.948.153.857.1
CuraGen89042.143.239.433.627.022.2
MIPS117255.575.473.769.660.756.9
>Two sources833.978.277.671.758.850.0
Affinity chromatography884.270.883.9100.0100.0
Co-purification1758.386.895.0100.0
Co-immunoprecipitation38618.385.988.690.088.788.9
Two-hybrid154173.064.862.757.950.343.4
>Two methods2109.982.480.377.873.760.0
Reciprocal interaction106450.475.874.770.362.760.3
Interconnected interaction119356.586.685.279.972.260.5
All2112100.063.659.555.848.342.4

It is easy to see that, using our method, it is hard to predict functional categories with extremely small number of members. It is also important to know the random rate for rather large categories. To test these cases, we selected the two largest and the two smallest categories from both the predictions of subcellular localization and cellular role (Table 3). It can be seen that our method is not effective for predicting small categories. However, the prediction accuracy for larger categories is fairly good compared to the dummy results, except for the largest one, the prediction of nuclear proteins, which occupies 30% of total proteins. In this case, the dummy prediction rates are unusually good for large n-values although the real prediction rates still outperform them significantly.

Table 3. Size effects of function categories on prediction accuracy
FunctionCategoryFrequency (%)n=1n=2n=3n=4n=5
  1. The third column represents the frequency of proteins belonging to each class and the subsequent columns represent the percentage prediction accuracy for each n value (the dummy prediction accuracy obtained from randomized function assignments is shown in parentheses). For more details, see text.

Subcellular localizationNuclear30.293.2 (50.1)91.4 (61.1)91.3 (67.3)94.0 (80.9)94.8 (86.5)
Cytoplasmic19.754.3 (27.6)56.2 (30.3)43.9 (29.9)38.1 (19.6)25.6 (12.8)
Secretory vesicles0.360.0 (0.0)0.0 (0.0)0.0 (0.0)0.0 (0.0)0.0 (0.0)
Endosome/endosomal vesicles0.30.0 (0.0)0.0 (0.0)0.0 (0.0)0.0 (0.0)0.0 (0.0)
Cellular roleSmall molecule transport7.520.1 (3.1)27.6 (0.0)20.0 (0.0)4.0 (0.0)4.2 (0.0)
Protein synthesis6.852.2 (7.3)52.3 (2.9)47.2 (0.0)23.5 (0.0)17.6 (0.0)
Phosphate metabolism0.433.3 (0.0)0.0 (0.0)0.0 (0.0)0.0 (0.0)0.0 (0.0)
Mitochondrial transcription0.40.0 (0.0)0.0 (0.0)0.0 (0.0)0.0 (0.0)0.0 (0.0)

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

In this paper, we have brought together several experimentally-determined protein–protein interaction data and assessed the prediction accuracy of protein function based on them. To avoid the bias that the number of proteins in a function is not equal to the numbers of proteins in different functions, the χ2 value has been calculated for each function. Moreover, the possibility of including indirectly interacting proteins in prediction was systematically sought. Three kinds of definitions of ‘protein function’ were tested. The simplest, the subcellular localization site, could be predicted with the highest reliability, 72.7%. This value seems to be reasonable because proteins at different localization sites can sometimes interact each other and because experimental errors cannot be neglected. The cellular role of proteins could be predicted with the accuracy of 63.6%. This definition of protein function may be the most useful for further experimental study and it is highly understandable that the ‘guilt-by-association’ principle is effective to deduce the cellular role. By contrast, it is quite natural that the prediction accuracy of biochemical function was as low as 52.7%, because there is no reason why two interacting proteins should have the same biochemical function. In summary, protein–protein interaction data can be a valuable resource for deducing the function of uncharacterized proteins, despite the imperfection of these data.

To test the validity of our approach more precisely, we performed several complementary tests. First, our prediction is based on the data obtained from various sources and experimental techniques. In principle, the use of much data is favourable in the sense that a query protein has more chance that its interaction had been stored in the database. However, it is possible that the use of unfavourable data source may lower the overall prediction accuracy. In Table 2, we checked this concern. For data sources, the MIPS set, the largest one, is the best, while the two-hybrid method, the largest one, is the worst for experimental techniques. Because both Ito's data and the CuraGen data are based on the two-hybrid method, they are consistent, but the data from the two-hybrid method occupy the large part of our data and cannot be omitted. It should be also noted that methods such as affinity chromatography and co-purification tend to detect multimeric proteins, with their function apparently easier to deduce. Our observation that prediction accuracy is improved in cases when neighbouring proteins are interconnected can be interpreted to mean that such cases mostly occur for multimeric proteins. Thus, the data from the two-hybrid experiments are invaluable for our prediction, although combination with other methods would be desirable.

Second, the dependency of prediction accuracy on the number of binding partners is presented in Figure 3. The accuracy slightly improves with the increase of the binding number, but for larger numbers this tendency is hard to detect, probably because of the small number of available interaction data. Since the function of proteins with three or more partners can be more accurately predicted than the function of those with one or two partners, we conclude that prediction becomes more reliable with the increasing numbers of binding partners. Therefore, we can expect that the future accumulation of more interaction data will enable us to predict protein function more precisely. In addition, we will soon be able to apply our method to the proteins of other organisms, because several projects to determine protein–protein interaction data of various organisms are ongoing (Walhout et al., 2000; McCraith et al., 2000).

Third, in our prediction method, we used the χ2 value instead of the raw number of neighbouring members considering the size difference between target functions. Was it really effective? We tested the prediction accuracy for the two largest and two smallest categories on both subcellular localization and cellular role (Table 3). These values were compared with dummy accuracy of the same size of data with randomly-assigned function. Although χ2 values are not effective for correctly predicting very small categories, they are reasonably effective for larger categories. However, for an extraordinarily large category, nuclear proteins, the dummy accuracy shows apparently good values if we consider distantly related neighbours. Figure 2A also shows that the prediction accuracy for subcellular localization asymptotically goes to a rather high limit (about 55%) when the n-value becomes very large. Nevertheless, we believe that the total accuracy of our method is not much affected by this ‘default’ value because the prediction accuracy shows the highest value when the n-value is 1.

For the convenience of experimental researchers, some of our open prediction results are summarized in Table 4. From the 409 proteins that had no annotations about ‘subcellular localization’, ‘cellular role’ and ‘biochemical function’ in our interaction map, 16 were selected because they had at least five binding partners. The prediction result of subcellular localization by PSORT (Nakai and Horton, 1999) is also shown. Although PSORT is applicable to every sequence independently from experimental results, its prediction accuracy is estimated to be about 60%. In addition, it should be noted that PSORT does not predict peripheral membrane proteins unless they are GPI-anchored to the membrane. For the proteins in Table 4, the prediction results of the current method seem more reliable. Moreover, it is interesting that independently performed three types of functional predictions, independently performed, sometimes give consistent results. For example, YEL015W is predicted to localize at the nucleus, its cellular role being RNA processing/modification, and its biochemical function being a hydrolase or an RNA-binding protein. All of these results seem to suggest a unique story of its function. Again, we can conclude that, although the method proposed in this paper has the limitation that it is only applicable to proteins with experimentally-detected interaction, our method is promising because the data of protein–protein interactions are accumulating rapidly.

Table 4. Open functional prediction for 16 uncharacterized yeast proteins
ORF name#PartnerSubcellular localizationCellular roleBiochemical functionPSORT prediction
  1. The second column shows the number of binding partner proteins.

YBR270C5NuclearProtein degradationGuanine nucleotide exchange factorMitochondrial
YDL012C7Unspecified membraneOther metabolismTransporterCytoplasmic or nuclear
YDL071C5CytoplasmicCell cycle controlTransferaseCytoplasmic
YEL015W6NuclearRNA processing/modificationHydrolase or RNA-binding proteinNuclear
YGL161C7Lysosome/vacuole or GolgiMembrane fusionDocking protein or complex assembly proteinEndoplasmic reticulum
YGR058W5CytoplasmicSignal transduction or cell stressProtein kinase or transferaseNuclear
YHR105W5Unspecified membraneProtein modificationOxidoreductaseNuclear
YIL105C7Cytoplasmic or nuclearRecombination or differentiationOxidoreductase or DNA-binding proteinNuclear
YLR368W9MitochondrialCell cycle controlProtein conjugation factor or proteaseMitochondrial
YLR423C5Lysosome/vacuole or GolgiMembrane fusion or mitosisDocking protein or GTP-binding protein/GTPaseCytoplasmic
YLR456W5NuclearRNA processing/modification or RNA splicingSpliceosomal subunit or translation factorNuclear
YNL047C5CytoplasmicCell polarity or mating responseProtein phosphataseNuclear
YNL091W6NuclearRNA splicing or RNA processing/modificationSpliceosomal subunit or RNA-binding proteinNuclear
YNR053C6NuclearRNA splicing or RNA processing/modificationSpliceosomal subunit or RNA-binding proteinNuclear
YOL082W6Nuclear or lysosome/vacuoleDNA repair or meiosisTranscription factor or proteaseNuclear
YPR105C8Endoplasmic reticulumVesicular transportReceptor or docking proteinNuclear

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

We thank Tetsushi Yada for many useful comments, and Proteome Inc. (http://www.proteome.com) for the offer of the YPD full spreadsheet. This work was supported in part by Special Coordination Funds for Promoting Science and Technology from the Science and Technology Agency. KN and TT were also supported by a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science and Culture of Japan.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References