Prediction of functional class of novel plant proteins by a statistical learning method

Authors

  • L. Y. Han,

    1. Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543;
    Search for more papers by this author
  • C. J. Zheng,

    1. Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543;
    Search for more papers by this author
  • H. H. Lin,

    1. Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543;
    Search for more papers by this author
  • J. Cui,

    1. Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543;
    Search for more papers by this author
  • H. Li,

    1. Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543;
    Search for more papers by this author
  • H. L. Zhang,

    1. Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543;
    Search for more papers by this author
  • Z. Q. Tang,

    1. Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543;
    Search for more papers by this author
  • Y. Z. Chen

    Corresponding author
    1. Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543;
    2. Shanghai Center for Bioinformation Technology, 100 Qinzhou Road, Shanghai, China 200235
      Author for correspondence: Y. Z. Chen Tel: +65 6874 6877 Fax: +65 6774 6756 Email: yzchen@cz3.nus.edu.sg
    Search for more papers by this author

Author for correspondence: Y. Z. Chen Tel: +65 6874 6877 Fax: +65 6774 6756 Email: yzchen@cz3.nus.edu.sg

Summary

  • • In plant genomes, the function of a substantial percentage of the putative protein-coding open reading frames (ORFs) is unknown. These ORFs have no significant sequence similarity to known proteins, which complicates the task of functional study of these proteins. Efforts are being made to explore methods that are complementary to, or may be used in combination with, sequence alignment and clustering methods.
  • • A web-based protein functional class prediction software, SVMProt, has shown some capability for predicting functional class of distantly related proteins. Here the usefulness of SVMProt for functional study of novel plant proteins is evaluated.
  • • To test SVMProt, 49 plant proteins (without a sequence homolog in the Swiss-Prot protein database, not in the SVMProt training set, and with functional indications provided in the literature) were selected from a comprehensive search of MEDLINE abstracts and Swiss-Prot databases in 1999–2004. These represent unique proteins the function of which, at present, cannot be confidently predicted by sequence alignment and clustering methods.
  • • The predicted functional class of 31 proteins was consistent, and that of four other proteins was weakly consistent, with published functions. Overall, the functional class of 71.4% of these proteins was consistent, or weakly consistent, with functional indications described in the literature. SVMProt shows a certain level of ability to provide useful hints about the functions of novel plant proteins with no similarity to known proteins.

Introduction

In the completely sequenced genome of Arabidopsis, the function of 30% of the putative protein-coding open reading frames (ORFs) is unknown (Arabidopsis Genome Initiative, 2000; Cho & Walbot, 2001). A similar percentage of unknown ORFs is expected in other plant genomes. The sequence of each of these ORFs has no significant similarity to those of known proteins, and their functions are difficult to probe by using sequence alignment and clustering methods. It is thus desirable to explore complementary methods, or a combination of methods, to provide useful hints about the function of unknown ORFs.

Various methods for probing protein function have been developed. These include evolutionary analysis (Eisen, 1998; Benner et al., 2000); hidden Markov models (HMM) (Fujiwara & Asogawa, 2002); structural consideration (Di Gennaro et al., 2001; Teichmann et al., 2001); protein/gene fusion (Enright et al., 1999; Marcotte et al., 1999); protein–protein interactions (Bock & Gough, 2001); motifs (Kunin et al., 2001); family classification by sequence clustering (Enright et al., 1999); and functional family prediction by statistical learning methods (Jensen et al., 2002; Karchin et al., 2002; Cai et al., 2003, 2004; Han et al., 2004). In the absence of clear sequence or structural similarities, the criteria for comparison of distantly related proteins become increasingly difficult to formulate (Enright & Ouzounis, 2000). Moreover, not all homologous proteins have analogous functions (Benner et al., 2000). The presence of a shared domain within a group of proteins does not necessarily suggest that these proteins perform the same function (Henikoff et al., 1997). Therefore careful evaluation is needed to determine which method or combination of methods is useful for facilitating functional study of novel proteins with no homology to proteins of known function.

A web-based protein functional class prediction software, SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi), has recently been shown to have some potential for predicting the functional class of distantly related proteins (proteins of low homology to other proteins) and homologous proteins of different functions (proteins of high homology to proteins of different functions) (Cai et al., 2003, 2004). SVMProt classifies proteins into functional classes defined from activities or physicochemical properties rather than sequence homology (Bock & Gough, 2001; Karchin et al., 2002; Cai et al., 2003, 2004; Han et al., 2004). Many of these classes are composed of multiple homolog groups and no sequence similarity is required in SVMProt predictions.

Such a sequence similarity-independent approach is made possible by the use of sequence-derived physicochemical properties (amino acid composition, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension) and structural properties (secondary structure, solvent accessibility) as the basis for classification of a protein. Thus SVMProt may potentially be explored to facilitate functional study of proteins without significant sequence homology to known proteins, which can either complement or be used in combination with other approaches such as sequence alignment and clustering methods. In the present work, SVMProt is assessed for its ability to predict the functional class of a number of novel plant proteins with their function indications described in the literature but have no homolog in entries in the Swiss-Prot database based on psi-blast search and with their functional indications provided in the literature.

Materials and Methods

Selection of novel plant proteins

The keyword ‘novel plant protein’ was used to search two sources of plant proteins that are described as novel and have their precise functional indications provided. One source was MEDLINE (Wheeler et al., 2003) abstracts published during 1999–2004. These proteins were further screened to remove those for which sequence information was not yet available in the protein databases. As the search was confined to abstracts, those proteins for which functional indication was not suggested in an abstract were not selected. Thus the selected proteins probably account for a proportion of the known novel plant proteins reported in the literature. The second source was the Swiss-Prot database(Boeckmann et al., 2003). The key word ‘novel plant’ was used to search the description field of the plant protein entries to find those for which precise functional indications were provided. A total of 413 proteins were selected from these two search procedures.

Some of the proteins selected may become less novel than originally described because of the subsequent findings of additional proteins. Thus psi-blast (Altschul et al., 1997) was searched for each protein against all entries in the Swiss-Prot protein database (Boeckmann et al., 2003) to determine whether it has a sequence homolog (including that of the same protein of different species). The commonly used criterion for homologs, the similarity score e-value being less than the inclusion threshold value of 0.005 (Altschul et al., 1997), was used in this work. Based on psi-blast analysis, 364 of the 413 proteins searched were found either to have at least one sequence homolog, or to be in the SVMProt training sets, and thus were not sufficiently novel for testing SVMProt.

The remaining 49 proteins had no sequence homolog in the Swiss-Prot entries of the Swiss-Prot database. They were not in the training sets of SVMProt, and were thus sufficiently novel for testing SVMProt. These proteins, along with their NCBI protein or Swiss-Prot accession numbers, functional indications described in the literature and related references, are given in Table 1. Only a few proteins published before 2001 were selected, primarily because proteins published in earlier years are more likely to have homologs released in subsequent years. Because of the lack of a sequence homolog, the function of these proteins would not be confidently predicted by sequence alignment and clustering tools if they were recently discovered. They are thus ideal for testing the feasibility of using SVMProt for facilitating functional characterization of novel plant proteins.

Table 1.  Novel plant proteins – functional indications as suggested by the literature and SVMProt-predicted functional classes
Host plantProtein (NCBI or Swiss-Prot accession number)Function described in the literature (reference)SVMProt-predicted functional class (probability of correct prediction)SVMProt prediction status
  1. The SVMProt-predicted functional classes are categorized in one of the four classes: C, consistent with published functional indications; WC, weakly consistent with functional indications as described in the literature (consistency of predicted functional class with published function can be considered inconclusive); NC, not consistent with published functional indications; ‘?’, currently available information insufficient to determine prediction status.

  2. *Some of these enzymes do not yet have a reference because they have been submitted to Swiss-Prot database before publication (Boeckmann et al., 2003).

Aegilops speltoidesSPP(AAO33156)Sucrose phosphatase (EC 3.1.3.24) (Lunn, 2003)EC 3.1 hydrolases – acting on ester bonds (94.7%); EC 2.7 transferases – transferring phosphorus- containing groups (76.2%) Photosystem I (58.6%) TC 1.C channels/pores – pore-forming toxins (proteins and peptides) (58.6%)C
Arabidopsis thalianaMYB-related transcription factor EPR1 (BAC98462)DNA-binding protein, specifically recognizes the sequence 5′-YAAC[GT]G-3′ (Kuno et al., 2003)DNA-binding protein (98.8%)C
OsGRF1 (AAM52876)Putative transcription factor playing a regulatory role in stem elongation (Kim et al., 2003)DNA-binding protein (97.0%)WC
PSI-O (CAD37939)Contains two transmembrane helices (Knoetzel et al., 2002)Chlorophyll (82.2%); Transmembrane (80.4%); Photosynthesis (62.2%); Photosystem I (58.6%); TC 3.A.1 ABC family (58.6%); EC 5.3 intramolecular oxidoreductase (58.6%)C
ERN1 (CAA75349)A novel ethylene-regulated nuclear protein, putative transcription factor (Trentmann, 2000)EC 4.2 carbon–oxygen lyase (58.6%); 7 transmembrane receptor metabotropic glutamate family (58.6%)NC
CPDase (O04147)Cyclic phosphodiesterase (EC 3.1.4) (Yamada et al., 2003)DNA-binding proteins (71.3%) EC 3.1 hydrolases – acting on ester bonds (58.6%) Photosystem I (58.6%)C
GddR precursor (Q9FPU3)Glutathione dependent dehydroascorbate reductase (EC 1.8.5.1)*EC 1.8 oxidoreductases – acting on a sulfur group of donors (99.0%) Photosynthesis (58.6%) Transmembrane (58.6%) TC 1.C channels/pores – pore-forming toxins (proteins and peptides) (58.6%)C
Brassica napusbnKCP1 (AAO53442)Contains a putative kinase-inducible domain, may function as a transcription factor (Gao et al., 2003)DNA-binding protein (68.5%) TC 1.C pore-forming toxins (proteins and peptides) (58.6%)WC
Cucumis melo var. reticulatusCucumisin (Q940D5)Serine protease (EC 3.4.21.25) (Yamagata et al., 2002) EC 3.4 hydrolases – acting on peptide bonds (peptidases) (99.0%) EC 3.1 hydrolases – acting on ester bonds (78.4%)C
Glycine maxCPP1 (CAA09028)DNA-binding protein interacting with the promoter of the soybean leghemoglobin gene Gmlbc3 (Cvitanich et al., 2000)No function predictedNC
GmN6L (AAL86737)Both as a soluble protein and as a peripheral membrane protein bound to the peribacteroid membrane, a late nodulin (Trevaskis et al., 2002)EC 1.1 oxidoreductase acting on CH–OH group of donors (73.8%) EC 3.6 hydrolase acting on acid anhydrides (71.3%)?
Hordeum vulgareLem1 (AAK58425)Possibly associated with membranes, may play a role in organ development (Skadsen et al., 2002) Photosynthesis (58.6%); Photosystem I (58.6%); EC 3.4 peptidase (58.6%)NC
HvCaBP1 (AAK92225)Putative calcium binding protein (Jang et al., 2003)EC 1.3 oxidoreductase acting on CH–CH group of donors (85.4%) EC 4.1 carbon–carbon lyase (62.2%) Outer membrane (58.6%)WC
PM19 (AAF29532)Putative plasma membrane protein (Ranford et al., 2002) Transmembrane (76.2%); Chlorophyll biosynthesis (58.6%); Photosystem I (58.6%)C
AOC (Q711R0)Allene oxide cyclase precursor (EC 5.3.99.6)*EC 5.3 isomerases – intramolecular oxidoreductases (95.7%) Transmembrane (65.4%) EC 1.10 oxidoreductases – acting on diphenols and related substances as donors (65.4%)C
Hordeum vulgare ssp. vulgareHvS40 (CAC36956)A novel nucleus-targeted protein with connection to the degeneration of chloroplasts (Krupinska et al., 2002)DNA-binding protein (78.4%); Nuclear receptor (65.4%); EC 2.1 transferase of one-carbon groups (58.6%); RNA-binding protein (58.6%)WC
SnIP1 (CAB97356)Interacts with SNF1-related protein kinase (Slocombe et al., 2002)EC 3.4 Peptidase (71.3%); EC 5.3 intramolecular oxidoreductase (68.5%); EC 1.3 oxidoreductase acting on CH–CH group of donors (65.4%); EC 3.5 hydrolase acting on carbon–nitrogen bonds other than peptide bonds (62.2%)NC
Hordeum vulgare var. distichumSpp (Q84ZX7)Sucrose phosphatase (EC 3.1.3.24) (Lunn, 2003)EC 3.1 Hydrolases – acting on ester bonds (97.7%) EC 2.4 transferases – glycosyltransferases (91.3%) Photosystem I (58.6%)C
Ipomoea batatasSPA15 (AAK08655)Specifically associated with the cell wall (Yap et al., 2003)Outer membrane (58.6%) TC 3.A.5 type II secretory pathway family (58.6%)C
Lilium longiflorumLlSCL (BAC77269)Strong activity of transcriptional activation (Morohashi et al., 2003)No function predictedNC
LycopersiconesculentumNCP1 (AAK83083)Nuclear matrix protein, structural protein with a function both in the nucleus and cytoplasm (Rose et al., 2003)Structural protein (99.0%) EC 5.4 intramolecular transferase (85.4%) DNA-binding proteins (65.4%) TC 3.A.3 P-type ATPase family (58.6%)C
AOC (Q9LEG5)Allene oxide cyclase precursor (EC 5.3.99.6) (Ziegler et al., 2000)EC 5.3 isomerases – intramolecular oxidoreductases (99.0%) EC 4.1 lyases – carbon–carbon lyases (58.6%)C
RdRP (Q9ZR58)RNA-directed RNA polymerase (EC 2.7.7.48) (Schiebel et al., 1998)EC 2.7 transferases – transferring phosphorus-containing groups (99.1%) Photosystem I (58.6%)C
LeMan3 (Q9FUQ6)Endo-β-mannanase precursor (EC 3.2.1.78)*EC 2.4 transferases – glycosyltransferases (95.2%) Photosystem I (58.6%);NC
MAN5 (Q6YM50)Mannan endo-1,4-β-mannanase precursor (EC 3.2.1.78) (Filichkin et al., 2004)EC 2.3 transferases – acyltransferases (58.6%) EC 2.3 transferases – acyltransferases (68.5%)NC
Oenothera bertianaA6L (P07513)ATP synthase protein 8(EC 3.6.3.14) (Boeckmann et al., 2003)TC 3.A.1 ABC family (99.0%) EC 3.1 hydrolases – acting on ester bonds (58.6%) mRNA-binding Proteins (58.6%) TC 3.A.5 Type II secretory pathway family (58.6%)N
C
Oryza sativaOsBLE2 (BAB88327)Contains nine possible transmembrane regions, involved in BL-regulated growth and development processes (Yang et al., 2003)Transmembrane (99.1%) TC 1.A alpha-type channels (58.6%) TC 1.A.1 voltage-gated ion channel family (58.6%)C
OsMYBS2 (AAN63153)Trans-activates a promoter containing the TATCCA element, interacts with other protein factors (Lu et al., 2002)Transmembrane (62.2%)NC
β-1,2-xylosyltransferase (Q703H1)β-1,2-xylosyltransferase (EC 2.4.2.38)*EC 2.4 transferases – glycosyltransferases (98.8%) EC 4.2 lyases – carbon–oxygen lyases (58.6%) Outer membrane (58.6%)C
AOC(Q8L6H4)Allene oxide cyclase (EC 5.3.99.6)*EC 5.3 isomerases – intramolecular oxidoreductases (99.0%) Photosynthesis (62.2%) EC 1.10 oxidoreductases – acting on diphenols and related substances as donors (58.6%)C
Aspartate aminotransferase (Q42991)Aspartate aminotransferase (EC 2.6.1.1)*TC 1.C channels/pores – pore-forming toxins (proteins and peptides) (58.6%) RNA-binding proteins (58.6%) TC 3.A.3 P-type ATPase family (58.6%) Photosystem I (58.6%) TC 3.A.5 type II secretory pathway family (58.6%)NC
Petunia × hybridaNEC1 (AAG34696)Reminiscent of a transmembrane protein, possible role in sugar metabolism and nectar secretion (Ge et al., 2000)Transmembrane (98.0%)C
Phaseolus angularisGrG (Q9SBZ0)Galactinol-raffinose galactosyltransferase (EC 2.4.1.67) (Peterbauer et al., 1999)EC 2.4 transferases – glycosyltransferases (96.4%) EC 4.2 lyases – carbon–oxygen lyases (78.4%)C
Pinus sylvestrisAntimicrobial peptide 1 (AAL05052)Interferes with cell wall synthesis (Asiegbu et al., 2003)Plant defense (58.6%) TC 2.A.3 amino acid-polyamine-organocation (APC) family (58.6%)N
C
Antimicrobial peptide 2 (AAL05053)Interferes with cell wall synthesis (Asiegbu et al., 2003)EC 3.4 peptidase (58.6%) EC 4.1 carbon–carbon lyase (58.6%) Plant defense (58.6%) TC 2.A.3 amino acid-polyamine-organocation (APC) family (58.6%)C
Antimicrobial peptide 3 (AAL05054)Interferes with cell wall synthesis (Asiegbu et al., 2003)EC 3.4 peptidase (58.6%) Plant defense (58.6%) TC 2.A.3 amino acid-polyamine- organocation (APC) family (58.6%)C
Antimicrobial peptide 4 (AAL05055)Interferes with cell wall synthesis (Asiegbu et al., 2003)EC 3.4 peptidase (58.6%) EC 4.1 carbon–carbon lyase (58.6%) Plant defense (58.6%) TC 2.A.3 amino acid-polyamine- organocation (APC) family (58.6%)C
Pisum sativumPLATZ1 (BAB69816)Zinc-dependent DNA-binding protein responsible for A/T-rich sequence-mediated transcriptional repression (Nagano et al., 2001)Nuclear receptor (68.5%) EC 3.1 hydrolase acting on ester bonds (62.2%)C
EC 3.1 hydrolase acting on ester bonds (62.2%) EC 4.1 carbon–carbon lyase (58.6%)C
PAT1 (Q43085)Phosphoribosylanthranilate transferase (EC 2.4.2.18)*EC 2.4 transferases – glycosyltransferases (99.1%) Transmembrane (95.2%) EC 2.7 transferases – transferring phosphorus- containing groups (76.2%) 7 transmembrane receptor (Secretin family) (58.6%) 7 transmembrane receptor (metabotropic glutamate family) (58.6%)C
rfs (Q8VWN6)Raffinose synthase (EC 2.4.1.82)*EC 2.4 transferases – glycosyltransferases (98.6%) Aptamer-binding protein (98.0%) EC 4.2 lyases – carbon-oxygen lyases (78.4%) TC 3.A.3 P-type ATPase family (58.6%)C
Secale cerealeSucrose phosphatase (Q84ZX9)Sucrose phosphatase (EC 3.1.3.24) (Lunn, 2003)EC 3.1 hydrolases – acting on ester bonds (86.8%) EC 2.7 transferases – transferring phosphorus-containing groups (62.2%) Photosystem I (58.6%) 
Solanum tuberosumCR6(P48505)Ubiquinol–cytochrome C reductase complex 6.7 kDa protein (EC 1.10.2.2) (Jansch et al., 1995)EC 1.10 oxidoreductases – acting on diphenols and related substances as donors (99.0%) EC 3.4 hydrolases acting on peptide bonds (peptidases) (58.6%) Photosystem I (58.6%) TC 3.A.1 ABC family (58.6%)C
Triticum aestivumCPDase (P62809)Cyclic phosphodiesterase (EC 3.1.4) (Genschik et al., 1997)EC 1.9 oxidoreductases – acting on a heme group of donors (58.6%) EC 3.1 hydrolases acting on ester bonds (58.6%) EC 3.4 hydrolases acting on peptide bonds (peptidases) (58.6%) Photosynthesis (58.6%) Photosystem I (58.6%) Photosystem II (58.6%) TC 3.A.5 type II secretory pathway family (58.6%) Aptamer-binding protein (58.6%)C
SPP3 (Q9ARG8)Sucrose-6F-phosphate phosphohydrolase SPP3 (EC 3.1.3.24)*EC 3.1 hydrolases – acting on ester bonds (96.4%) EC 2.7 transferases – transferring phosphorus-containing groups (92.1%) Photosystem I (58.6%) TC 1.C channels/pores – pore-forming toxins (proteins and peptides) (58.6%)C
SPP2 (Q9AXK5)Sucrose-6F-phosphate phosphohydrolase SPP2 (EC 3.1.3.24)*EC 3.1 hydrolases – acting on ester bonds (93.6%) EC 2.7 transferases – transferring phosphorus- containing groups (68.5%) Photosystem I (58.6%) TC 1.C channels/pores – pore-forming toxins (proteins and peptides) (58.6%)C
SPP1 (Q9AXK6)Sucrose-6F-phosphate phosphohydrolase SPP1 (EC 3.1.3.24)*EC 3.1 hydrolases – acting on ester bonds (96.4%) Photosystem I (58.6%) EC 2.1 transferases – transferring one-carbon groups (58.6%) TC 1.C channels/pores – pore-forming toxins (proteins and peptides) (58.6%)C
fut12 (Q7XAG0)GDP-fucose protein-O-fucosyltransferase 1 (EC 2.4.1.221)*EC 2.4 transferases – glycosyltransferases (98.4%) EC 2.7 transferases – transferring phosphorus- containing groups (96.7%) Photosystem I (58.6%)C
Vigna unguiculataFGARAT (Q8W160)Formylglycinamide ribonucleotide amidotransferase (EC 6.3.5.3)*EC 2.4 transferases – glycosyltransferases (95.2%) DNA-binding proteins (73.8%) EC 2.7 transferases – transferring phosphorus- containing groups (62.2%) TC 3.A.1 ABC family (58.6%)N
C
Zea maysATPase (Q6V916)Putative AAA-type ATPase (EC 3.6.4.8)*EC 2.4 transferases – glycosyltransferases (71.3%) Photosynthesis (58.6%) Chlorophyll biosynthesis (58.6%)N
C

Computational method

SVMProt is based on a statistical learning method, support vector machines (SVM: Vapnik, 1995; Burges, 1998; Muller et al., 2001; Scholkopf & Smola, 2002). In addition to the prediction of protein functional class (Jaakkola et al., 2000; Karchin et al., 2002; Tsuda et al., 2002; Cai et al., 2003, 2004; Han et al., 2004), SVM has also been used for a variety of protein classification problems including fold recognition (Ding & Dubchak, 2001); analysis of solvent accessibility (Yuan et al., 2002); prediction of secondary structures (Hua & Sun, 2001); and protein–protein interactions (Bock & Gough, 2001). As a method that uses sequence-derived physicochemical properties of proteins as the basis for classification, SVM may be particularly useful for functional classification of distantly related proteins and homologous proteins of different functions (Cai et al., 2003, 2004). This feature makes SVM a potentially attractive method for the functional study of proteins, either to complement, or to be used in combination with, conventional procedures such as sequence alignment and clustering methods.

Protein functional classes covered by SVMProt

As shown in the supplementary material (Appendix S1, available online), 97 protein functional classes are currently covered by SVMProt. Some of these classes are associated with plant-specific functional roles; others are based on hierarchical biochemical classification schemes such as the enzyme classification system recommended by the International Union of Biochemistry and Molecular Biology (IUBMB) enzyme nomenclature committee and the transporter classification system. The plant-related functional classes of SVMProt include photosynthesis, photorespiration, photoreceptors, photosystems, chlorophyll, chlorophyll biosynthesis, plant defense and herbicide resistance. Additional plant-related classes will soon be added to SVMProt. Examples of these classes are phytochrome, fruit and ripening, nodulation and seed-storage proteins.

The biochemical classes of SVMProt include 46 enzyme families, nine channel/transporter families, 21 transporter families, four RNA-binding protein families, DNA-binding proteins, five G-protein-coupled receptors, nuclear receptors, tyrosine-receptor kinases, cell-adhesion proteins, coat proteins, envelope proteins, outer-membrane proteins, structural proteins and growth factors. Two broadly defined families of antigens and transmembrane proteins are also included. The majority of the known types of plant protein are included in these biochemical classes.

Some of these classes may be too broad to be specifically informative for the study of protein functions. Thus functional classes defined from the Gene Ontology (GO) system (Harris et al., 2004) may be more suitable for constructing SVMProt prediction systems. Many of the GO classes contain a relatively small number of proteins, which causes imbalance between the number of class members and that of nonmembers of some classes. An unbalanced data set is known to reduce the accuracy of SVM classification. Because of the intrinsic weakness of SVM algorithms in dealing with GO classes composed of a small number of proteins, it is not yet feasible to develop accurate SVMProt systems for these GO classes. Thus the current version of SVMProt is primarily based on the hierarchical biochemical classes. Efforts are being made to develop SVM algorithms for unbalanced data (Kim & Park, 2004): progress in this will make it possible to develop comprehensive SVM-prediction systems based on GO classes.

Training and validation

SVMProt has been trained and tested by using a large number of proteins in each of the 97 functional classes and in 7316 Pfam curated protein families (Cai et al., 2003, 2004; Han et al., 2004). A training set contains positive proteins (those in a functional class) and negative proteins (those outside a class). The negative proteins of a class are from representative proteins of the Pfam families without a member in that class (Cai et al., 2003, 2004; Han et al., 2004). A training set needs to be both diverse and kept as small as possible, to ensure adequate representation and reduce unnecessary noise generated from data redundancy. Thus a number of training sets are generated by using proteins randomly selected from the respective positive and negative protein pools at different percentages. The training set and the corresponding classification system with the optimum testing accuracy are used in SVMProt.

The numbers of member and nonmember protein sequences used to train each of the functional classes were in the range 14–3892 and 513–7299, respectively; those of the independent evaluation sets were in the range 7–4841 and 986–7291, respectively. Examples of the training sets are 945 member and 1896 nonmember sequences for glycosyltransferases (EC2.4); 3892 member and 5324 nonmember sequences for transferases of phosphorus-containing groups (EC2.7); 461 member and 1122 nonmember sequences for intramolecular oxidoreductases (EC5.3); and 1054 member and 1914 nonmember sequences for photosynthesis proteins. Examples of the independent evaluation sets are 288 member and 4926 nonmember sequences for glycosyltransferases (EC2.4); 3016 member and 5707 nonmember sequences for transferases of phosphorus-containing groups (EC2.7); 178 member and 5009 nonmember sequences for intramolecular oxidoreductases (EC5.3); and 657 member and 6796 nonmember sequences for photosynthesis. The numbers of sequences in other classes can be found in the supplementary material (Appendix S1).

With the exception of those of plant-specific functional classes, training sets of biochemical classes in SVMProt were constructed using proteins from multiple organisms, not just plants. As proteins in each of these biochemical classes share common functionally relevant structural and physicochemical characteristics at the active sites, irrespective of host species, the corresponding SVMProt prediction systems developed from these multispecies proteins are expected to be useful for facilitating functional studies of plant proteins as well as proteins from other species.

SVMProt is trained for protein classification in the following manner. First, every protein sequence is represented by specific feature vector xi with the following structural and physicochemical properties as its components: amino acid composition, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility for each residue in the sequence (Cai et al., 2003). A detailed description of construction of the feature vector from a protein sequence is provided in the supplementary material (Appendix S2, available online).

Samples of proteins of a particular class (positive samples) and those that do not belong to this class (negative samples) are then used as the training set. The feature vectors xis of these positive and negative samples are projected into a high dimensional feature space using a kernel function inline image where a hyperplane can be constructed to separate the positive and negative samples (Vapnik, 1995). This hyperplane in the higher-dimensional space can be used for classifying an unknown protein into either the positive (protein is predicted to be a member of the class) or negative group (protein is predicted to be a nonmember of the class) on the basis of which side of the hyperplane the feature vector of this protein is located.

Scoring of SVM classification of proteins has been estimated by using a reliability index R value (Hua & Sun, 2001). There is a statistical correlation between R value and expected classification accuracy (probability of correct classification) (Hua & Sun, 2001). Thus another quantity, P value, is introduced to indicate the expected classification accuracy. The P value is derived from the statistical relationship between the R value and actual classification accuracy based on the analysis of 9932 positive and 45 999 negative samples of proteins (Cai et al., 2003).

The prediction accuracy for members of a functional class is > 70% for 77 classes, and 45–70% for the remaining 20 classes. The accuracy for nonmembers of a class is 82–100% for all classes. These accuracies are derived from independent sets of proteins, which have been found to be roughly similar to those obtained by using a tenfold cross-validation study (Cai et al., 2004). For instance, the computed accuracies for members and nonmembers of the enzyme EC1.2 family are 79 and 98% using an independent set, and 77 and 99% using tenfold cross-validation. These computed accuracies are mostly comparable with those from other SVM studies of proteins (Bock & Gough, 2001; Ding & Dubchak, 2001; Hua & Sun, 2001; Yuan et al., 2002). Thus both the coverage and prediction accuracy of SVMProt have reached a meaningful level for probing the functional class of novel plant proteins.

Not all the SVMProt classes are at the same hierarchical level. These classes are mixtures of subfamilies, families and superfamilies. Some classes, such as antigen, need to be more clearly defined into specific subclasses. While it is desirable to define all the classes at the same level, this is not yet possible because of insufficient data for the subhierarchies of some families and superfamilies. Efforts are being made to collect sufficient data so that SVMProt classification systems can be constructed on the basis of a more evenly distributed family structure. Nonetheless, predictions on the basis of the current structures provide useful hints about the functions of proteins.

Results and Discussion

SVMProt prediction results

Table 1 gives SVMProt-ascribed functional classes for each of the 49 novel plant proteins, together with their functional indications as described in the literature. More than one functional class may be characterized by SVMProt: all the predicted classes are listed in Table 1. The probability of correct prediction for each class is also given in Table 1. There are 31 proteins whose SVMProt-predicted class is consistent with functional indications described in the literature, 20 of which are enzymes with their enzyme classification (EC) number assigned in the literature. The predicted functional class of these enzymes can thus be confirmed based on the comparison with their respective EC number. These enzymes are: SPP of Aegilops speltoides (Lunn, 2003); CPDase (Yamada et al., 2003) and GddR of Arabidopsis thaliana; cucumisin of Cucumis melo var. reticulates (Yamagata et al., 2002); AOC of Hordeum vulgare; Spp of H. vulgare var. distichum (Lunn, 2003); AOC (Ziegler et al., 2000) and RdRP (Schiebel et al., 1998) of Lycopersicon esculentum; β-1,2-xylosyltransferase and AOC of Oryza sativa; GrG of Phaseolus angularis (Peterbauer et al., 1999); PAT1 and rfs of Pisum sativum; sucrose phosphatase of Secale cereale (Lunn, 2003); CR6 of Solanum tuberosum (Jansch et al., 1995); CPDase (Genschik et al., 1997); SPP1, SPP2, SPP3 and fut12 of Triticum aestivum. Some of these enzymes do not yet have a reference because they have been submitted to the Swiss-Prot database before publication (Boeckmann et al., 2003).

Four proteins are predicted by SVMProt to be transmembrane, and one as a DNA-binding protein, which can be directly compared with their respective functional indications as described in the literature. PSI-O of A. thaliana is known to have two transmembrane helices (Knoetzel et al., 2002). PM19 of H. vulgare has been described as a putative plasma membrane protein (Ranford et al., 2002). OsBLE2 of O. sativa has been suggested to contain nine possible transmembrane regions (Yang et al., 2003). NEC1 of Petunia ×hybrida has been found to be reminiscent of a transmembrane protein with possible roles in sugar metabolism and nectar secretion (Ge et al., 2000). MYB-related transcription factor EPR1 of A. thaliana is part of a regulatory feedback loop that suppresses its own expression, and is known to specifically recognize the DNA sequence 5′-YAAC[GT]G-3′ (Kuno et al., 2003). Thus the SVMProt-predicted transmembrane or DNA-binding property for each of these proteins appears to be consistent with the descriptions in the literature.

The predicted functional class of the other four proteins also appears to be consistent with published functional indications based on our analysis. NCP1 of L. esculentum has been described as a nuclear matrix protein and a candidate for a plant-specific structural protein with a function in both the nucleus and cytoplasm (Rose et al., 2003). The top hit of SVMProt-predicted functional classes for this protein is the structural protein class, which includes matrix proteins, core proteins, viral occlusion body and keratins. This prediction is consistent with function as described in the literature. Antimicrobial peptides 2, 3 and 4 of Pinus sylvestris are known to interfere with cell-wall synthesis (Asiegbu et al., 2003). The top hit of SVMProt-predicted class for each of these proteins is the EC3.4 peptidase enzyme family. It is known that members of the peptidase family, such as penicillin-binding protein 5 (EC 3.4.16.4), polymerize and modify peptidoglycan, the stress-bearing component of the bacterial cell wall, thereby helping to create the morphology of the peptidoglycan exoskeleton together with cytoskeleton proteins that regulate septum formation and cell shape (Popham & Young, 2003). While other mechanisms cannot be ruled out yet, EC3.4 peptidase enzymatic activity is certainly an interesting possibility for the observed interference of each of these proteins with cell-wall synthesis.

There are four proteins for which SVMProt-predicted function may provide an inconclusive explanation for the functional indications described in the literature. The predicted functional class of each of these proteins is thus considered to be weakly consistent with literature descriptions pending further studies. PLATZ1 of Pisum sativum has been found to be responsible for A/T-rich sequence-mediated transcriptional repression (Nagano et al., 2001). The top-ranked SVMProt-predicted class for this protein is the nuclear receptor class. Nuclear receptors such as the thyroid hormone T3 receptor are known to be involved in transcriptional repression (Tomita et al., 2004). Thus there is a possibility that PLATZ1 is a nuclear receptor. SPA15 of Ipomoea batatas has been found to be specifically associated with the cell wall and involved in oligogalacturonide signaling during leaf senescence (Yap et al., 2003). SVMProt predicts this protein as an outer-membrane protein, which may possess both properties.

Three of these proteins are predicted by SVMProt as DNA-binding proteins. OsGRF1 of A. thaliana has been described as a putative transcription factor possibly playing a regulatory role in stem elongation (Kim et al., 2003). bnKCP1 of Brassica napus contains a putative kinase-inducible domain and may function as a transcription factor (Gao et al., 2003). Transcription factors primarily exert their function through DNA binding (Ulker & Somssich, 2004), thus these two proteins are probably DNA-binding proteins. HvS40 of H. vulgare ssp. vulgare has been described as a novel nucleus-targeted protein (Krupinska et al., 2002). The nuclear HvS40 protein belongs to the group of nuclear proteins that possess two putative nuclear localization sequences (NLS), one belonging to the SV40 class, the other to the class of bipartite NLSs. In the case of the maize transcription factor opaque 2, the bipartite NLS has an additional function in DNA binding (Krupinska et al., 2002). Although there is no other evidence, it is possible that HvS40 of H. vulgare ssp. vulgare is a DNA-binding protein, like other bipartite NLS-containing proteins such as the maize transcription factor opaque 2.

Another protein, HvCaBP1 of H. vulgare, has been described as a putative calcium-binding protein (Jang et al., 2003). One of the SVMProt-predicted classes for this protein is the outer-membrane class. It is known that some outer-membrane proteins, such as the 40 kDa outer-membrane protein, form spheroplast at a high rate in an isotonic medium in the presence of calcium, and the calcium–protein complex helps maintain the structural integrity of the cell wall (Tada & Yamaguchi, 1994). Thus there is a possibility that HvCaBP1 is a calcium-binding outer-membrane protein.

Overall, SVMProt-characterized functions of 71.4% of the 49 novel plant proteins studied in this work were found to be consistent, or weakly consistent, with the functional indications described in the literature. Because all these proteins have no homolog in the Swiss-Prot database, based on psi-blast search, our study suggests that SVMProt has certain level of capability for probing the functional class of novel proteins with no or low homology to known proteins, and this capability is not based on sequence similarity or clustering.

It is of interest to examine if additional information encoded in the sequence of these 49 novel plant proteins is useful to facilitate functional prediction. Twenty-four proteins were found to contain sequence signatures that matched the sequence profiles of specific domains described in the profile-HMM library of the Pfam database (Bateman et al., 2004). From the described functions of these domains, tentative functions can be assigned on the basis of the matched sequence profiles. The assigned functions of 21 proteins, representing 42% of the 49 novel plant proteins, are found to be consistent or weakly consistent with the published functional indications. These include five proteins for which function is either incorrectly assigned or unassigned by SVMProt. Therefore it is useful to explore multiple methods for the functional study of novel plant proteins.

Factors affecting the performance of SVMProt

Several factors may affect the accuracy of SVMProt for functional characterization of novel plant proteins. One is the diversity of protein samples used for training SVMProt. It is likely that not all possible types of protein, particularly those of distantly related members, are adequately represented in some protein classes. This can be improved with the availability of more protein data. Not all distantly related proteins with the same function have similar structural and chemical features. There are cases in which different functional groups, unconserved with respect to position in the primary sequence, mediate the same mechanistic role because of flexibility at the active site (Todd et al., 2002). This plasticity is unlikely to be described sufficiently by the physicochemical descriptors currently used in SVMProt. Therefore SVMProt in its present form is not expected to be capable of classifying these types of distantly related enzyme.

Some of the SVMProt functional classes are at the level of families and superfamilies, which may include a broad spectrum of proteins. It has been shown that SVM does not work as well as HMM for distinguishing proteins in a superfamily, but may be more accurate for subfamily discrimination (Karchin et al., 2002). Thus the use of some large families and superfamilies as the basis for classification may affect the predictive accuracy of SVMProt to some extent.

High overall accuracy (≈ 90%), but a relatively modest true positive (TP) to true negative (FN) ratio (TP : FN < 100 : 37), are found in some functional classes. SVMProt generally gives an accurate prediction of TN for these classes. The imbalance in the number of members and nonmembers of each class probably contributes to the high overall accuracy with a modest TP : FN ratio. Examination of FN proteins in these classes shows that many of these proteins either belong to more than one class, or contain a domain shared by proteins in another class. These proteins are often classified into the related classes. An analysis of a broad range of classes indicates that a substantial portion (61.3%) of incorrectly classified proteins are of low sequence similarity to most of the other members of their families: the sequence similarity score E value of each protein against most members of its family is significantly > 0.05. The percentage of low sequence-similarity proteins in a class is not expected to be very high. Therefore the modest TP : FN ratio of these classes probably results from inadequate representation of the low-homology proteins in the training set, as well as intrinsic difficulty in classifying them.

SVMProt prediction may be further improved by using protein subfamilies as the basis of classification, a more comprehensive set of protein samples, and more refined protein descriptors. The SVMProt optimization procedure and the feature vector selection algorithm may also be improved by adding additional constraints, and by incorporating independent component analysis and kernel principal component analysis (Scholkopf et al., 1998) in the preprocessing steps.

More than one functional class is predicted by SVMProt. While a scoring function has been introduced to rank the predicted classes with certain level of success (Cai et al., 2003, 2004; Han et al., 2004), there is a need for a more accurate method of choosing between these predicted classes. The same need arises in determining the function of proteins of similar sequences (Eisen, 1998; Benner et al., 2000). A promising approach is to use phylogenetic profiles to guide the selection of functionally linked proteins (Pellegrini et al., 1999; Benner et al., 2000). This approach is applicable for proteins with enough homologs to construct their phylogenetic profiles. Although it is not yet applicable to the novel proteins studied here, because of the lack of a homolog for each protein, this approach can enhance the general prediction accuracy of SVMProt. On SVM classification of a protein, multiple sequence alignment is conducted against all the representative proteins of all of the predicted functional classes. The alignment patterns are then used to generate a phylogenetic tree which is used to guide the final selection of the functional class of this protein (Gatesy et al., 1993; Pellegrini et al., 1999). This feature is being incorporated into SVMProt.

Concluding remarks

SVMProt shows a certain level of capability in the functional characterization of a number of novel plant proteins, particularly those without sequence homology to proteins of known function. This suggests that SVMProt is potentially useful to facilitate the functional study of distantly related proteins in plants as well as other organisms. Further improvements in protein functional family coverage, protein sample collections and the SVM algorithm may enable the development of SVMProt into a practical tool for facilitating functional studies of unknown ORFs in the genomes of plants and other organisms.

Acknowledgements

This work was supported in part by grants from Shanghai Commission for Science and Technology (04DZ19850, 04QMX1450, 04DZ14005), and the ‘973’ National Key Basic Research Program of China (2004CB720103, 2004CB715901).

Ancillary

Advertisement