An analysis of the Candida albicans genome database for soluble secreted proteins using computer-based prediction algorithms


  • Samuel A. Lee,

    Corresponding author
    1. Infectious Diseases Section, Department of Medicine, Yale University School of Medicine, New Haven, CT, USA
    2. Infectious Diseases Section, Department of Medicine, VA Connecticut Healthcare System, West Haven, CT, USA
    • Infectious Diseases Section, VA Connecticut Healthcare System, 950 Campbell Avenue, Building 8 (111-I), West Haven, CT 06516, USA.
    Search for more papers by this author
    • These authors contributed equally to this work.

  • Steven Wormsley,

    1. Infectious Diseases Section, Department of Medicine, Yale University School of Medicine, New Haven, CT, USA
    Search for more papers by this author
    • These authors contributed equally to this work.

  • Sophien Kamoun,

    1. Department of Plant Pathology, The Ohio State University, Ohio Agricultural Research and Development Center, Wooster, OH, USA
    Search for more papers by this author
  • Austin F. S. Lee,

    1. Department of Mathematics and Statistics, Boston University, Boston, MA, USA
    2. Center for Health Quality, Outcomes, and Economic Research, Bedford VA Hospital, Bedford, MA, USA
    Search for more papers by this author
  • Keith Joiner,

    1. Infectious Diseases Section, Department of Medicine, Yale University School of Medicine, New Haven, CT, USA
    Search for more papers by this author
  • Brian Wong

    1. Infectious Diseases Section, Department of Medicine, Yale University School of Medicine, New Haven, CT, USA
    2. Infectious Diseases Section, Department of Medicine, VA Connecticut Healthcare System, West Haven, CT, USA
    Search for more papers by this author


We sought to identify all genes in the Candida albicans genome database whose deduced proteins would likely be soluble secreted proteins (the secretome). While certain C. albicans secretory proteins have been studied in detail, more data on the entire secretome is needed. One approach to rapidly predict the functions of an entire proteome is to utilize genomic database information and prediction algorithms. Thus, we used a set of prediction algorithms to computationally define a potential C. albicans secretome. We first assembled a validation set of 47 C. albicans proteins that are known to be secreted and 47 that are known not to be secreted. The presence or absence of an N-terminal signal peptide was correctly predicted by SignalP version 2.0 in 47 of 47 known secreted proteins and in 47 of 47 known non-secreted proteins. When all 6165 C. albicans ORFs from CandidaDB were analysed with SignalP, 495 ORFs were predicted to encode proteins with N-terminal signal peptides. In the set of 495 deduced proteins with N-terminal signal peptides, 350 were predicted to have no transmembrane domains (or a single transmembrane domain at the extreme N-terminus) and 300 of these were predicted not to be GPI-anchored. TargetP was used to eliminate proteins with mitochondrial targeting signals, and the final computationally-predicted C. albicans secretome was estimated to consist of up to 283 ORFs. The C. albicans secretome database is available at Copyright © 2003 John Wiley & Sons, Ltd.


The prevalence of invasive candidiasis has increased dramatically. Candida spp. have become the fourth most commonly isolated microorganism from the bloodstream of hospitalized patients in the USA and sixth most common nosocomial pathogen overall (Emori and Gaynes, 1993; Jarvis, 1995). Although Candida albicans is an increasingly important opportunistic pathogen, an incomplete understanding of Candida pathogenesis and cell biology has limited our ability to diagnose and treat candidiasis.

C. albicans has a diploid genome and has no clearly defined sexual cycle (Hull et al., 2000; Magee and Magee, 2000). Consequently, classical genetic approaches have been of limited value for studying this organism. Recent application of molecular genetic techniques in the analysis of medically important fungi has significantly enhanced fungal pathogenesis research. Important developments in the study of C. albicans biology and pathogenesis include the cloning and sequencing of many individual genes, development of integrative and episomal DNA transformation systems (De Backer et al., 2000), chromosomal mapping (Tait et al., 1997) and the near completion of a genome sequencing project (Magee, 1998; Scherer and Magee, 1990). The C. albicans genome sequencing project at Stanford University ( (Tzung et al., 2001) has already identified >6000 partial and complete C. albicans genes. Based on annotation information from 6165 ORFs in CandidaDB (, approximately 3400 of these C. albicans genes are structural homologues of known genes from Saccharomyces cerevisiae; however, the functions of most of the remaining 2700 genes or gene fragments are unknown. Thus, although our knowledge of C. albicans genome structure is growing rapidly, our challenge now is to utilize this information to understand the functional significance of these genes, particularly in relation to C. albicans biology and pathogenesis.

Numerous algorithms for prediction of protein structure and function are available either as computer applications or as Internet-based programs, and several have been used for preliminary functional analyses of large sets of predicted proteins. Recent analyses of entire yeast genome databases have included identification of GPI-anchored proteins in S. cerevisiae (Caro et al., 1997), a comprehensive BLAST analysis of C. albicans homologues of S. cerevisiae sexual cycle genes (Tzung et al., 2001) and a prediction of the subcellular localization of the entire S. cerevisiae proteome (Kumar et al., 2002). Thus, one approach to rapidly analyse an entire genome is to utilize database information and computer-based algorithms to predict structure and/or function (Tjalsma et al., 2000; Kamoun et al., 2001).

In C. albicans, as in other eukaryotes, proteins are typically targeted for entry into the general secretory pathway by the presence of a N-terminal signal sequence. Signal sequences have a tripartite structure characterized by a central hydrophobic core (h-region) usually consisting of 6–15 amino acid (aa) residues which is flanked by hydrophilic N- and C-terminal regions (Martoglio and Dobberstein, 1998). The h-region is important for correct targeting and membrane insertion of the peptide. The polar C-terminal region often contains helix-breaking proline and glycine residues and small uncharged residues at the −3 and −1 positions which determine the signal peptide cleavage site (von Heijne, 1990). The polar N region is variable in length and frequently is positively charged. Although some proteins lacking N-terminal signal sequences reach the extracellular space, the majority of soluble secreted proteins in C. albicans are likely to be transported via the general secretory pathway. Therefore, we took advantage of SignalP version (v)2.0, a program that accurately identified eukaryotic signal peptides (Nielsen et al., 1997, 1999; Nielsen and Krogh, 1998) and other predictive algorithms to define a computational secretome of C. albicans from the genome sequences.


We reasoned that soluble secreted proteins should have the following characteristics: (a) an N-terminal signal peptide; (b) no transmembrane domains; (c) no GPI-anchor site; and (d) no localization signal predicted to target the protein to mitochondria or other intracellular organelles. ORFs fulfilling these four criteria gained inclusion in the set of soluble secreted proteins we have defined as the computational secretome.

Data sets

In order to test our SignalP criteria, we assembled a validation set consisting of 47 C. albicans proteins that are known to be secreted (or members of known families of secreted proteins) and 47 that are known not to be secreted (see Table 1 and supplementary data). Next, we retrieved the entire set of non-redundant open reading frames (ORFs) from the C. albicans genome database from CandidaDB ( and divided it into three manageable partial databases. Sequence data from CandidaDB was obtained from the Stanford Genome Technology Center website at This sequencing of C. albicans was accomplished with the support of the NIDR and the Burroughs Wellcome Fund.

Table 1. Candida albicans known proteins used as validation set
GeneAccession No.Description
A. Secretory proteins
 ALS1.5eocCA0909Agglutinin-like protein, 5′-end
 ALS10CA0448Agglutinin like protein
 ALS11.5fCA1425Agglutinin-like protein, 5′-end
 ALS2.5fCA1473Agglutinin-like protein, 5′-end
 ALS3.5eocCA0591Agglutinin-like protein, 5′-end
 ALS4.5fCA1527Agglutinin-like protein, 5′-end
 ALS5CA2852Agglutinin-like protein
 ALS6CA5713Agglutinin-like protein
 ALS7CA5699Agglutinin-like protein
 ALS9.5eocCA0315Agglutinin-like protein, 5′-end
 CFL1CA3460Ferric reductase
 CHT1CA5859Endochitinase 1 precursor
 CHT2CA1051Chitinase 2 precursor
 CHT3CA5987Chitinase 3 precursor
 EXG1CA0822Glucan 1,3-β-glucosidase
 HWP1CA2825Hyphal wall protein
 HYR1CA1576Hyphally regulated protein
 KRE9CA2958Cell wall synthesis protein
 LIP1CA1079Secretory lipase
 LIP10CA4757Secretory lipase
 LIP2CA3068Secretory lipase
 LIP3CA4731Secretory lipase
 LIP4CA3182secretory lipase
 LIP5CA4417Secretory lipase
 LIP6CA4756Secretory lipase
 LIP7CA5556Secretory lipase
 LIP8CA1241Secretory lipase
 LIP9.exon1CA4423Secretory lipase 9, exon 1
 LIP9.exon2CA4422Secretory lipase 9, exon 2
 PHR1CA4857GPI-anchored pH-responsive glycosyl transferase
 PHR2CA3867pH-Regulated protein 2
 PLB1CA1975Phospholipase B
 PLB2CA0825Phospholipase B
 PLB3CA3834Phospholipase B (by homology)
 PLB4.5fCA0185Phospholipase, 5′-end (by homology)
 PLB5CA2223Putative phospholipase B precursor
 SAP1CA2660Secreted aspartyl proteinase
 SAP2CA3138Aspartic protease
 SAP3CA6065Secreted aspartyl proteinase
 SAP4CA2055Secreted aspartyl proteinase
 SAP5CA2499Secreted aspartyl proteinase 5
 SAP6CA0968Secreted aspartyl protease
 SAP7CA1929Secreted aspartyl proteinase 7
 SAP8CA1266Aspartic protease
 SAP9CA4700Aspartyl proteinase 9 (by homology)
B. Non-secretory proteins
 AAF1CA5726Adhesion and aggregation-mediating surface antigen
 ADE2CA6139Phosphoribosylaminoimidazole carboxylase
 ARD1CA6015Protein N-acetyltransferase subunit
 ARG4CA4292Argininosuccinate lyase
 ARO4CA14843-Dehydro-deoxyphosphoheptonate aldolase, tyrosine-inhibited
 CAP1CA0183Transcriptional activator
 CBF1CA2473Putative centromere binding factor 1
 CBK1CA2022Serine/threonine protein kinase
 CDC10CA4259Cell division control protein
 CDC25CA4698Cell division cycle protein
 CDC3CA0844Cell division control protein
 CLA4CA1710Protein kinase homologue
 CLA4CA1710Protein kinase homologue
 CPH1CA0154Transcription factor
 CPP1CA4721Probable protein-tyrosine phosphatase
 EFG1CA2787Enhanced filamentous growth factor
 FAB1CA2179Phosphatidylinositol 3-phosphate 5-kinase
 FAS2.5fCA6105Fatty-acyl-CoA synthase, α-chain, 5′-end
 GSP1CA2675GTP-binding protein
 HEM3CA0306Porphobilinogen deaminase
 HIS1CA4792ATP phosphoribosyltransferase
 HK1CA4676Histidine kinase
 HOG1CA4677Ser/thr protein kinase of MAPK family
 IMH3.exon1CA1246IMP dehydrogenase, exon 1
 LEU2CA5618Isopropyl malate dehydrogenase
 MET3CA5238ATP sulphurylase
 MIG1CA1593Transcriptional regulator
 MIG1CA1593Transcriptional regulator
 MKC1CA5865ser/thr Protein kinase of MAP kinase family
 NAG1CA1130Glucosamine-6-phosphate deaminase
 NRG1CA5289Similar to transcriptional repressor Nrg1p/Nrg2p
 PMI40CA0988Mannose-6-phosphate isomerase (phosphomannose isomerase) (PMI)(phosphohexomutase)
 RHO1CA2866GTP-binding protein of the rho subfamily of ras-like proteins (by homology)
 SEC18.5fCA5270Vesicular fusion protein by homology, 5′ end
 SEC4CA2681GTP-binding protein
 SNF1CA3361Serine/threonine protein kinase
 SSK1CA5233Putative response regulator two-component phosphorelay gene
 TPS1CA4084Trehalose-6-phosphate synthase
 TUP1CA3852General transcription repressor
 URA3CA2801Orotidine-5-monophosphate decarboxylase (Candida albicans)
 VPS34CA01491-Phosphatidylinositol 3-kinase
 YPT1CA5077GTP-binding protein of the rab family (by homology)
 YRB1CA5822GTPase-activating protein (by homology)

Prediction algorithms

We then queried the validation set and the entire C. albicans ORF set with SignalP v2.0 ( to identify N-terminal signal peptides. We defined a positive SignalP hit as the simultaneous presence of three criteria: (a) signal peptide predicted by SignalP-NN; (b) signal peptide predicted by SignalP-HMM; and (c) signal peptide cleavage site located within 10–40 aa from the N-terminus.

Next, we analysed the set of ORFs predicted to encode proteins with N-terminal signal peptides with the following prediction algorithms to determine whether three additional characteristics were present (Table 2). TMHMM ( was used to predict transmembrane domains (Krogh et al., 2001), big-PI Predictor ( was used to identify potential GPI-anchor sites (Eisenhaber et al., 1999, 2001), and TargetP v1.01 ( was used to identify mitochondrial localization sequences (Emanuelsson et al., 2000). Because some ORFs in CandidaDB are partial, in the case of ORFs containing only the 5′ end of a gene, the corresponding 3′ end of the gene was retrieved from CandidaDB when available and used to query big-PI Predictor for the GPI-anchor analysis. The final dataset comprises all the ORFs whose deduced proteins are potentially soluble secreted proteins in C. albicans according to these four major characteristics.

Table 2. Summary of prediction algorithms used
AlgorithmPredictionValidation setAccuracy (%)CommentsReference
  1. Accuracy is defined as concordance of computational algorithm with experimentally-derived data.

SignalP v2.0N-terminal signal peptideSWISS-PROT version 2997Accuracy reported is for eukaryotic data setNielsen et al., 1997
TMHMM v2.0Transmembrane domainsSet of 160 experimentally known transmembrane proteins and 645 soluble proteins97–98Accuracy reported refers to individual transmembrane helices. Accuracy is 77.5% for correct topology of proteinKrogh et al., 2001
Set of 188 experimentally known transmembrane proteins and 634 soluble proteins68 or greaterIndependent evaluation of 16 different algorithms to predict transmembrane domains. TMHMM was the best performing program in this evaluationMoller et al., 2001
big-PI PredictorGPI-anchor siteSet of 177 proteins from SWISS-PROT and SWISS-NEW>80 Eisenhaber et al., 1999, 2001
TargetP v1.01Mitochondrial or other localization sequenceSet of 2738 mitochondrial and 1652 other proteins from SWISS-PROT90Accuracy reported is for non-plant sequencesEmanuelsson et al., 2000

Properties of the computational secretome

As a supplementary analysis, we compared subcellular localization data of S. cerevisiae homologues from the Yeast Protein Localization server (, which integrates data derived from genome-wide experimental and predicted subcellular localization studies (Drawid and Gerstein, 2000; Kumar et al., 2002; Drawid et al., 2000; Alexandrov and Gerstein, 2001). Annotation information directly from CandidaDB was used to identify C. albicans and S. cerevisiae homologues for comparison, and no additional criteria was imposed on these assignments to define homology.

Statistical analysis

We used a discriminant analysis (Kleinbaum et al., 1998) based on Mean S and HMM scores from SignalP to analyse the validation set and derive a discriminant function. This discriminant function was applied to the validation set and then to the SignalP predictions of the entire set of C. albicans ORFs and used to re-assign classifications to secretory and non-secretory categories.


When the 47 known secretory proteins were analysed with SignalP, the S scores were all >0.6 and the HMM scores were all >0.8. In contrast, the 47 non-secretory C. albicans proteins all had S scores <0.25 and HMM scores <0.1 (Figure 1A). The standard criteria provided by SignalP correctly predicted that all 47 secreted proteins had N-terminal signal peptides (SP+) and that all 47 non-secreted proteins did not (SP). In order to generate criteria for predicting the presence or absence of N-terminal signal peptides specifically in C. albicans, we used a statistical discriminant analysis based on Mean S and HMM scores from SignalP to derive prediction parameters for the unknowns. The derived discriminant function based on the validation set was: L = −918.235–123.455*(Mean S score)+1983.44*(HMM score), where L values <0 predicted classification to the non-secretory group, and L values >0 predicted classification to the secretory group (an L value of 0 is indeterminate). When the discriminant function was applied to the 94 proteins in the validation set, none required re-classification.

Figure 1.

(A) Distribution of SignalP v2.0 scores for (i) 47 known and annotated C. albicans secreted and 47 non-secreted proteins and (ii) 6165 ORFs identified from CandidaDB. Raw Mean S and HMM scores were plotted for ORFs encoding proteins in the validation set of known secretory and non-secretory C. albicans proteins, and then for the entire set of 6165 ORFs from CandidaDB. Unmodified SignalP predictions are represented as follows: solid circle, presence of a Signal peptide; solid triangle, absence of a signal peptide. (B) Frequency plot of secretory and non-secretory proteins in C. albicans. Mean S and HMM scores for the entire set of C. albicans ORFs from CandidaDB are shown. The calculated discriminant function generated from the validation set scores is shown as a solid line on the X–Y axis

When all 6165 ORFs from CandidaDB were analysed using SignalP v2.0, 83.8% of deduced proteins either had an S score >0.7 and HMM score >0.8 or an S score <0.25 and HMM score <0.4, and the remaining ORFs had intermediate mean S and HMM scores, thus separating most ORFs into a clear bimodal distribution (Figure 1B). Using our three standard SignalP criteria (SP+ by mean S score, SP+ by HMM score, signal peptide cleavage site within 10–40 aa of N-terminus), we predicted that 495 of the 6165 ORFs encoded proteins with N-terminal signal peptides. When our C. albicans-derived discriminant function was applied to all 6165 ORFs, the classifications were nearly identical except for three of 495 predicted secretory and five of 5607 predicted non-secretory proteins. Because our approach is intended to be inclusive rather than exclusive, we re-assigned only the five ORFs identified as ‘non-secretory’ by SignalP to the secretory group and analysed these separately (Table 3).

Table 3. (A) Discriminant analysis of secretory and non-secretory proteins. After generating a discriminant function based on data from the validation set, the SignalP scores for the set of C. albicans ORFs from CandidaDB were analysed. The majority of ORFs had concordant predictions using the two methods. The discriminant analysis re-classified five non-secretory predictions to secretory, and three secretory predictions to non-secretory. (B) List of mis-matches between SignalP prediction and discriminant analysis.
  Discriminant analysis
SignalPSecretory492 (8.06%)3 (0.05%)495 (8.20%)
analysisNon-secretory5 (0.08%)5602 (91.81%)5607 (91.80%)
 Total497 (8.14%)5605 (91.86%)6102*
GeneAccession No.Mean S scoreHMM scoreL scoreTrans-membrane domainsGPIMito-chondrial SSFunction
  • *

    ORFs predicted to have N-terminal signal peptides by SignalP v2.0 but that did not fulfil our three standard criteria were classified as indeterminate and excluded from this analysis. Thus, percentages shown are based on 6102 analysable ORFs.

  • **

    Probably represents a signal peptide, not a true transmembrane domain.

Group prior = secretory by SignalP
 IPF11508CA30230.5720.469−32.20953NNUnknown; similarity to Sc integral membrane proteins
 Rta1p and Rtm1p
 IPF6880CA21850.4730.443−55.27074NNUnknown; no significant homology to S. cerevisiae
 IPF8760CA42210.7220.459−48.62621/SP**NNUnknown; no significant homology to S. cerevisiae
Group prior = non-secretory by SignalP
 URA7CA16350.3080.4999.57691NNCTP synthase 1 (by homology); Sc homologue is a cytosolic protein
 VMA5CA07110.1890.50015.39180NNH+-ATPase V1 domain 42 kDa subunit (by homology); Sc homologue is a vacuolar membrane protein

When the 495 deduced proteins predicted to have N-terminal signal peptides were analysed with TMHMM, 103 were predicted to have two or more transmembrane domains, 97 were predicted to have one transmembrane domain, and 295 were predicted to have no transmembrane domains. Of the 97 deduced proteins predicted to have one transmembrane domain, the transmembrane domain was located within the first 40 N-terminal amino acids in 55. Because TMHMM may not distinguish signal peptides from transmembrane domains, the 295 deduced proteins with no transmembrane domains and the 55 deduced proteins with a single transmembrane domain within 40 aa of the N-terminus were considered to be 350 potential soluble secreted proteins (Figure 2).

Figure 2.

Flowchart of strategy used to identify C. albicans soluble secreted proteins using a series of prediction algorithms. A positive SignalP hit was defined as the simultaneous presence of three criteria: (1) Signal peptide predicted by SignalP-NN; (2) Signal peptide predicted by SignalP-HMM; and (3) Signal peptide cleavage site located within 10–40 aa from the N-terminus. *Because TMHMM may not distinguish Signal peptides from transmembrane domains, 295 deduced proteins with no transmembrane domains and 55 deduced proteins with a single transmembrane domain within 10–40 aa of the N-terminus were considered to be 350 potential soluble secreted proteins. Of 58 ORFs predicted to encode GPI-anchored proteins in the set of 495 SP+ ORFs, 50 remained after the analysis with TMHMM. After eliminating ORFs predicted to encode proteins with mitochondrial signal sequences, 283 ORFs were predicted to be the set of ORFs encoding soluble secreted proteins

Next, to identify GPI-anchored proteins which might not be extracellularly secreted, the database of 495 SP+ ORFs was queried with big-PI Predictor. Because ALS1, ALS3, ALS4 and ALS5 ORFs consist of 5′ fragments in CandidaDB, the corresponding 3′ fragments were retrieved and used for this analysis. After excluding SP+ ORFs encoding proteins with greater than one transmembrane-domain, this algorithm identified a total of 58 potential GPI-anchored proteins. In the database of 350 SP+ ORFs used for further analysis to predict the secretome, there were 50 predicted GPI-anchored proteins (Table 4).

Table 4. GPI-anchor predictions. A total of 58 ORFs are predicted to encode GPI-anchored proteins from the 495 SP+ dataset; 35 ORFs are unnamed; 29 ORFs are of unknown function by homology. Analysis of the ALS family of genes is preliminary, due to partial and incomplete ORFs in CandidaDB
Gene nameAccession No.Gene lengthProtein lengthPredictionHMM scoreMean S scorePredicted TMDescriptionSubcellular localization of Sc homologue
ALS1.5eocCA09091974 658Signal peptide0.9970.9400Agglutinin-like protein, 5′-endER
ALS10CA044847611586Signal peptide1.0000.9560Agglutinin like proteinER
ALS11.5fCA14252859 952Signal peptide1.0000.9600Agglutinin-like protein, 5′-endER
ALS3.5eocCA05912658 886Signal peptide0.9800.9120Agglutinin-like protein, 5′-endER
ALS4.5fCA152747821593Signal peptide1.0000.9560Agglutinin-like protein, 5′-endER
ALS5CA285240441347Signal peptide0.9950.9190Agglutinin-like proteinER
ALS6CA571341011366Signal peptide0.9960.8991/SPAgglutinin-like proteinER
CRH11CA03751362 453Signal peptide0.9990.7860Probable membrane proteinER
CRH12CA18351515 504Signal peptide0.9850.9011Cell wall proteinER
CSA1CA558530571018Signal peptide0.9960.8640Mycelial surface antigen by homologyN/A
DFG5CA48221356 451Signal peptide0.9940.9241Required for filamentous growthPM
EXG2CA41801440 479Signal peptide0.9990.9090Glucan 1,3-β-glucosidase-like by homologyER
HWP1CA28251905 635Signal peptide0.8960.6130Hyphal wall proteinER
HYR1CA15762814 937Signal peptide0.9950.9370Hyphally-regulated proteinN/A
IFF2CA271437501249Signal peptide0.9640.9580Unknown functionER
IFF4CA581945811526Signal peptide0.9810.8560Unknown functionER
IFF7CA546836781225Signal peptide0.7710.7421Unknown functionER
IPF10662CA38271179 392Signal peptide0.9990.8730Unknown functionN/A
IPF10919CA2625 660 219Signal peptide0.9970.8900Similar to Flo1p (by homology)ER
IPF11998CA18981554 517Signal peptide0.9950.9071Unknown functionN/A
IPF12022CA362231471048Signal peptide0.8360.8690Extracellular α-1,4-glucan glucosidase (by homology)N/A
IPF12101CA2557 660 219Signal peptide0.9840.8450Mycelial surface antigen precursor (by homology to Candida gene CSA1)N/A
IPF1218CA4835 699 232Signal peptide0.9810.8190Similar to superoxide dismutase (by homology)CYT
IPF13070CA3763 891 296Signal peptide1.0000.9621/SPUnknown functionN/A
IPF1341CA51121371 456Signal peptide0.9980.8131Similarity to mucin proteins (by homology)N/A
IPF14081CA1553 924 307Signal peptide0.9800.9111Unknown functionN/A
IPF14126CA1313 999 332Signal peptide0.9990.9170Unknown functionN/A
IPF14598CA13602205 734Signal peptide0.8240.6081Unknown functionN/A
IPF14706CA1777 930 309Signal peptide1.0000.9611Unknown functionN/A
IPF15423CA2737 951 316Signal peptide0.9700.8930Putative superoxide dismutase (by homology)N/A
IPF15442CA01881155 384Signal peptide0.9990.8650Unknown functionER
IPF15581CA1720 420 139Signal peptide0.9950.8800Unknown functionN/A
IPF1580CA5418 396 131Signal peptide0.9880.6470Unknown functionER
IPF15911CA362335311176Signal peptide0.7330.8410Unknown functionN/A
IPF15957CA0171 255  84Signal peptide0.9660.6630Unknown functionN/A
IPF19706CA0647 723 240Signal peptide0.9980.8940Unknown functionN/A
IPF20008CA4124 342 113Signal peptide0.9940.8010Unknown functionN/A
IPF20103CA25022226 741Signal peptide0.9980.9290Unknown functionN/A
IPF20148CA3826 672 223Signal peptide0.9990.8450Unknown functionN/A
IPF20161CA4125 642 213Signal peptide0.9980.7550Unknown functionN/A
IPF20169CA4381 753 250Signal peptide1.0000.9150Unknown functionN/A
IPF3233CA2475 498 165Signal peptide0.9990.9390Unknown functionN/A
IPF3844CA24052262 753Signal peptide0.9520.6580Unknown functionN/A
IPF4089CA48631362 453Signal peptide0.7810.8820Secretory aspartyl proteinaseER
IPF4123CA3642 690 229Signal peptide0.9890.9521/SPUnknown functionN/A
IPF4299CA4246 336 111Signal peptide1.0000.8870Unknown functionN/A
IPF4722CA3252 510 169Signal peptide0.9930.8070Unknown FunctionN/A
IPF4724CA3253 816 271Signal peptide0.9890.7540Unknown FunctionN/A
IPF5185CA16781602 533Signal peptide1.0000.8790Putative cell wall protein (by homology)ER
IPF8129CA3630 681 226Signal peptide0.9840.7020Unknown functionN/A
IPF8796CA48001356 451Signal peptide0.9650.9340Putative GPI-anchored protein related to Phr1, Phr2 and Phr3 (by homology)ER
IPF9101CA2548 594 197Signal peptide0.9980.8420Unknown functionN/A
MID1CA02031680 559Signal peptide0.9990.8990Involved in Ca2+ influx during mating (by homology)ER/PM
PLB5CA22232265 754Signal peptide0.9650.6930Putative phospholipase B precursorER
RBT1CA28302145 714Signal peptide0.9500.7490Repressed by TUP1 protein 1N/A
RBT5CA2558 726 241Signal peptide1.0000.8380Repressed by TUP1 protein 5ER
SAP9CA47001635 544Signal peptide0.9990.9350Aspartyl proteinase 9 (by homology)ER
SSR1CA5213 705 234Signal peptide0.9980.8880Secretory stress response protein 1 (by homology)ER/CW

Because in eukaryotic cells secretory proteins may be targeted to intracellular organelles rather than secreted extracellularly, we used TargetP ( to identify mitochondrial targeting sequences in order to eliminate these ORFs from the dataset. In the set of 495 SP+ ORFs, 21 ORFs were excluded due to the presence of a mitochondrial localization signal in 14 ORFs or other localization signal in seven ORFs (Table 5).

Table 5. List of ORFs predicted by TargetP to contain mitochondrial and other intracellular targeting signals
Gene nameAccession No.Gene lengthProtein lengthPredictionHMM scoreMean S scorePredicted TMDescriptionSubcellular localization of Sc homologueTargetP
 ADH1CA47651305 434Signal peptide0.9830.7400Alcohol dehydrogenaseMITMIT
 COQ3CA2432 984 327Signal peptide0.7160.53303,4-Dihydroxy-5-hexaprenylbenzo-atemethyltransferaseMITMIT
 CPA1CA08741305 434Signal peptide0.9870.4880Arginine-specific carbamoylphosphate synthase, small chainCYTMIT
 DLD2CA59421602 533Signal peptide0.6780.7860D-Lactate ferrycytochrome c oxidoreductaseMITMIT
 FTI1CA2642 879 292Signal peptide0.8650.6200Rad52 inhibitorMITMIT
 IPF19578CA03712421 806Signal peptide0.9920.8820Unknown functionMITMIT
 IPF3361CA4785 756 251Signal peptide0.9050.5060Putative mitochondrial ribosomal protein S7 (by homology)MITMIT
 IPF7704CA4114 564 187Signal peptide0.7620.5260Unknown functionMITMIT
 IPF8359CA3383 456 151Signal peptide0.6610.6331Unknown functionMITMIT
 IPF864CA5347 366 121Signal peptide0.9170.6670Unknown functionNUCMIT
 IPF9370CA39641716 571Signal peptide0.6490.39212Unknown functionNo homologueMIT
 LAT1CA48751434 477Signal peptide0.8800.5980Dihydrolipoamide S-acetyltransferase (by homology)MITMIT
 MGM1CA27732667 888Signal peptide0.8720.5600GTPaseMITMIT
 MNT1CA34691296 431Signal peptide0.5590.7511/SPMannosyltransferase involved in n-linked and o-linked glycosylationER/GolgiMIT
 CBP1CA55591470 489Signal peptide0.8880.3890Corticosteroid binding proteinNUCOther
 COF1CA5409 435 144Signal peptide0.9010.7320CofilinCYTOther
 IPF149CA61271092 363Signal peptide0.5890.3556Peroxisomal membrane protein (by homology)No homologueOther
 IPF19608CA0674 558 185Signal peptide0.7610.3052Unknown functionNo homologueOther
 IPF8950CA2361 690 229Signal peptide0.6640.3890Unknown functionMITOther
 RPN2CA49882859 952Signal peptide0.6510.4360Proteasome regulatory subunit (by homology)?CYTOther
 SOD1.3CA4120 480 159Signal peptide0.8520.5690Cu,Zn-superoxide dismutase, 3′-endCYTOther

Functional information from CandidaDB was reviewed for the 495 SP+ ORFs, and 244 of these ORFs encode deduced proteins of unknown function. After the 495 SP+ ORFs were analysed with TMHMM, big-PI Predictor, and TargetP, 283 remaining ORFs fulfilled our four criteria: (a) presence of an N-terminal signal peptide; (b) lack of a transmembrane domain (unless located at the extreme N-terminus); (c) absence of a GPI-anchor; and (d) no mitochondrial or other localization signal. We propose that these 283 SP+ ORFs comprise the predicted secretome of C. albicans.

Of the 283 SP+C. albicans ORFs in the predicted secretome, 140 are of unknown function. The remaining 143 have an assigned function by homology to S. cerevisiae proteins (105) or are ORFs that encode known C. albicans proteins or members of known protein families (38). These 38 known C. albicans ORFs encode 25 extracellularly secreted proteins, 10 cell wall-associated proteins, two vacuolar proteins, and one ER-related protein (Table 6).

Table 6. List of known genes in the final predicted Candida albicans secretome
Gene nameAccession No.Gene lengthProtein lengthPredictionHMM scoreMean S scorePredicted TMDescriptionSecretory?
 HEX1CA42761689 562Signal peptide0.9980.9350N-AcetylglucosaminidaseY
 LIP1CA10791407 468Signal peptide0.9990.9680Secretory lipaseY
 LIP10CA47571398 465Signal peptide0.9990.9650Secretory lipaseY
 LIP2CA30681401 466Signal peptide0.9990.9410Secretory lipaseY
 LIP3CA47311416 471Signal peptide1.0000.9520Secretory lipaseY
 LIP4CA31821380 459Signal peptide0.9950.9680Secretory lipaseY
 LIP5CA44171392 463Signal peptide0.9270.9560Secretory lipaseY
 LIP6CA47561392 463Signal peptide0.9950.9300Secretory lipaseY
 LIP7CA55561281 426Signal peptide0.9930.9430Secretory lipaseY
 LIP8CA12411383 460Signal peptide0.9850.9640Secretory lipaseY
 LIP9.exon1CA4423 642 213Signal peptide0.9400.9610Secretory lipase 9, exon 1Y
 LIP9.exon2CA4422 792 263Signal peptide0.8480.8410Secretory lipase 9, exon 2Y
 PLB1CA19751818 605Signal peptide0.9870.9400Phospholipase BY
 PLB2CA08251830 609Signal peptide0.9980.9620Phospholipase BY
 PLB4.5fCA01851185 394Signal peptide1.0000.9580Phospholipase, 5′-end (by homology)Y
 SAP1CA26601176 391Signal peptide0.9990.9210Secreted aspartyl proteinaseY
 SAP2CA31381197 399Signal peptide0.9990.9350Aspartic proteaseY
 SAP3CA60651197 399Signal peptide0.9990.9260Secreted aspartyl proteinaseY
 SAP4CA20551254 417Signal peptide0.9970.9260Secreted aspartyl proteinaseY
 SAP5CA24991257 418Signal peptide0.9980.9250Secreted aspartyl proteinase 5Y
 SAP6CA09681257 418Signal peptide0.9980.9250Secreted aspartyl proteaseY
 SAP7CA19291767 588Signal peptide0.9810.9290Secreted aspartyl proteinase 7Y
 SAP8CA12661218 405Signal peptide0.9980.8550Aspartic proteaseY
 RBT4CA01041077 358Signal peptide0.9690.6240Repressed by TUP1 proteinY?
 RBT7CA0169 918 305Signal peptide0.9990.9140Repressed by TUP1Y?
Cell wall-associated        
 ALS2.5fCA147352711756Signal peptide1.0000.9560Agglutinin-like protein, 5′-endCW
 ALS7CA569960032000Signal peptide0.9930.8870Agglutinin-like proteinCW
 ALS9.5eocCA03152685 894Signal peptide0.9980.9340Agglutinin-like protein, 5′-endCW
 BGL21CA1541 927 308Signal peptide0.9860.9130Endo-β-1,3-glucanaseCW
 CHT1CA58591389 462Signal peptide0.9970.9571/SPEndochitinase 1 precursorCW
 CHT2CA10511752 583Signal peptide1.0000.8570Chitinase 2 precursorCW
 CHT3CA59871704 567Signal peptide0.9990.9590Chitinase 3 precursorCW
 KRE9CA2958 816 271Signal peptide0.9970.9270Cell wall synthesis proteinCW
 PHR1CA48571647 548Signal peptide0.9660.8930GPI-anchored pH responsive glycosyl transferaseCW
 PRA1CA4399 900 299Signal peptide1.0000.9650pH-Regulated antigenCW?
 APR1CA44761260 419Signal peptide0.9980.8100Aspartyl proteaseVAC
 CPY1.5fCA2123 258  85Signal peptide0.9980.8150Carboxypeptidase Y precursor, 5′-endVAC
 CYP51CA5717 639 212Signal peptide0.9320.8911/SPCyclophilin-peptidylprolyl cistrans isomerase or PPIaseER

Comparison of these 283 SP+C. albicans ORFs to S. cerevisiae subcellular localization data identified 73 S. cerevisiae homologues that also are secretory pathway proteins, 24 membrane proteins, 22 mitochondrial proteins, seven vacuolar proteins, and 50 homologues with other subcellular localizations. No S. cerevisiae homologue was identified by CandidaDB for 124 ORFs (see supplementary data).


Soluble secreted C. albicans virulence factors, such as the secreted aspartyl proteases (reviewed in Hoegl et al., 1996; Hube et al., 1997; Sanglard et al., 1997) and extracellular phospholipases (reviewed in Ghannoum, 2000; Niewerth and Korting, 2001) have been studied in detail, and many of these are found either on the cell surface or in the extracellular environment. Members of the secreted aspartyl protease (Sap) family of proteins are differentially secreted extracellularly depending on strain and environmental conditions (White and Agabian, 1995). C. albicans sap1, sap2 and sap3 mutants, and a triple sap4, sap5 and sap6 null mutant are attenuated in virulence in a mouse model of invasive candidiasis (Hube et al., 1997; Sanglard et al., 1997). In addition to the signal peptide, the Sap propeptide is also important for proper secretion (Monod et al., 2000). Extracellular phospholipases have also been implicated as virulence factors involved in the pathogenesis of infection with C. albicans (Leidich et al., 1998; Mukherjee et al., 2001). The deduced protein of C. albicans PLB1, a phospholipase B, is predicted to have a stretch of hydrophobic amino acids at the amino terminus that likely serves as a signal peptide. The family of C. albicans secretory lipases may also have a role in virulence (Fu et al., 1997; Hube et al., 2000). In addition, a number of secreted proteins that remain associated with the cell wall or membrane have been identified and shown to have a role in virulence, including the outer mannoprotein Hwp1 (Staab et al., 1999), the ALS family of genes (reviewed in Hoyer, 2001) and the pH-responsive genes PHR1-2 (Bernardis et al., 1998; Ghannoum et al., 1995; Fonzi, 1999; Saporito-Irwin et al., 1995). Thus, it is apparent that the ability of C. albicans to transport proteins to the cell surface via the secretion pathway and to secrete degradative enzyme out of the cell is required for virulence and pathogenesis (reviewed in Haynes, 2001).

Although it is clear that detailed studies of individual genes and gene products are essential, it is also important to obtain a more global perspective on secreted proteins, including those involved in virulence. The use of computer-based prediction algorithms is a powerful, systematic, and rapid tool to obtain preliminary functional information on gene products of an entire genome. Information can then be analysed in global fashion to organize functional groupings of predicted proteins, or individually, in order to identify genes of particular interest for future experimental study.

Since one of our interests is secreted proteins associated with virulence, we queried the C. albicans genome database in an effort to identify all genes whose deduced proteins would likely be soluble secreted proteins in order to: (a) obtain a global perspective on secreted proteins in C. albicans; and (b) identify previously uncharacterized genes for further experimental study. We therefore used a series of prediction algorithms available on Internet-based servers to analyse the C. albicans genome database. First, we assembled a validation set of known C. albicans secretory and non-secretory proteins to train our prediction algorithm. We generated a discriminant function which was applied to the unknown ORFs to derive a new cut-off whereby re-assignments could be made. Then we used our criteria based on the SignalP v2.0 algorithm to identify 495 ORFs with N-terminal signal peptides from a total of 6165 C. albicans ORFs. Using the discriminant function we re-classified two ORFs predicted by SignalP to be non-secretory as secretory. Thus, approximately 8% of the entire C. albicans genome consists of SP+ ORFs. In comparison, approximately 11% S. cerevisiae ORFs were predicted to encode signal peptides but a different prediction algorithm was used (Caro et al., 1997). Next, we used TMHMM to identify ORFs predicted to have no true transmembrane domains. In this subset, we identified 350 ORFs that fulfilled our criteria. Proteins with one or more transmembrane domains were eliminated as they were unlikely to be secreted extracellularly. However, because TMHMM does not necessarily distinguish signal peptides from transmembrane domains, if TMHMM predicted a transmembrane domain at the N-terminus, we did not exclude these ORFs from our dataset. We then identified 50 potential GPI-anchored proteins from this dataset (58 total from the SP+ TM 0–1 dataset, or 50 total from the SP+ TM 0 dataset). This is on the same order as the 51 GPI-anchored proteins predicted in S. cerevisiae using a similar analysis (Caro et al., 1997). Finally, we used TargetP to identify mitochondrial signal sequences to eliminate secretory proteins that are targeted to intracellular organelles, yielding a computationally-defined secretome of 283 ORFs.

Given the inherent limitations of the prediction algorithms, a minority of ORFs are probably assigned incorrectly. Our three SignalP criteria clearly separated the ORFs from the C. albicans genome into two distinct categories, although a small number of ORFs fell into an intermediate range. However, by using a discriminant analysis, we generated a function based on the validation sets to generate a new cut-off for assigning ORFs to secretory and non-secretory classifications. Thus, the vast majority of these SP+ ORFs are most likely proteins that enter the general secretory pathway, and either are secreted extracellularly, GPI-anchored, or in some cases targeted to distinct intracellular organelles. Overall, we predicted that the potential C. albicans secretome, according to our set of four prediction algorithms consists of up to an estimated 283 proteins.

In this study, we defined the predicted type II secretome of C. albicans. We identified, as expected, genes whose proteins have signal peptides and are known to be cell wall-associated, including EXG1 (exo-β-1,3-glucanase), BGL2 (β-1,3-glucan transferase), CHT1-3 (chitinases), and HEX1 (β-N-acetylglucosaminidase). We also identified genes whose proteins have signal peptides and are known to be secreted extracellularly, including: SAP1-9 (secreted aspartyl proteases); PLB1 (phospholipase B); LIP1 (secreted lipase); and gene homologues of glucoamylase, carboxypeptidase Y, acid phosphatase and alkaline phosphatase. Interestingly, 160 of these ORFs are unnamed, and 140 of them are ORFs of unknown function.

However, some C. albicans proteins are known to reach the extracellular space independently of the Type II secretion pathway. It remains unclear how proteins such as enolase (Mason et al., 1989; Franklyn et al., 1990; Angiolella et al., 1996; Sundstrom and Aliaga, 1994), Hsp70 and Hsp90 (Matthews et al., 1988) reach the cell wall and/or extracellular space. At this point it is not possible to predict such extracellular proteins using bioinformatic approaches. Genes encoding cell wall-associated proteins that were correctly predicted to lack signal peptides in our database included: ENO1 (enolase), SSA1 (Hsp70), PGK (phosphoglycerate kinase) and GAPDH (TDH1). Thus, while the majority of secreted proteins in C. albicans would be expected to be transported via the general secretory pathway (Lee et al., 2001; Mao et al., 1999), there may be several potential non-SEC dependent pathways in C. albicans that permit proteins to reach the extracellular space. In addition to non-specific mechanisms such as cell lysis or leakage, other possibilities include efflux pumps of the MDR and CDR families (reviewed by White et al., 1998), non-classical transport mediated by NCE1 (Cleves et al., 1996) and perhaps other unknown specific transporters.

In order to gain additional insight into the functional properties of these potential C. albicans secretory proteins in our dataset, we referred to the extensive subcellular localization data available for the corresponding S. cerevisiae homologues. Although no S. cerevisiae homologue was identified by CandidaDB for 124 ORFs, the majority of the evaluable S. cerevisiae homologues were secretory pathway proteins.

We also compared our database of predicted secretory proteins to an experimentally-derived set of C. albicans secreted proteins recently identified in a heterologous, genome-wide genetic screen. In this approach, in-frame fusions of C. albicans genomic DNA were fused to episomal vectors bearing mutant suc2 alleles, encoding invertase lacking the signal peptide region in S. cerevisiae, such that growth on sucrose implies the presence of a signal peptide (Monteoliva et al., 2002). This screen identified 68 putatively exported C. albicans proteins. Of 54 ORFs which could be directly retrieved from CandidaDB, our identification of signal peptides using our three SignalP criteria were concordant in 50 cases (see supplementary data).

Our GPI-anchor predictions should be interpreted with caution, as the big-PI Predictor is not intended to be fungal-specific. A recent report predicts C. albicans to encode 54 GPI-anchored proteins (Sundstrom, 2002). Of 44 ORFs available in CandidaDB our predictions correlated in 29 cases.

Important limitations of our approach is that it relies on prediction algorithms with a defined error rate which could potentially be greater in specific organisms. In addition, there are gene fragments in CandidaDB which can potentially confuse the prediction algorithms; thus, results obtained with partial ORFs must be cross-checked to obtain relevant upstream or downstream sequences if available and evaluated cautiously. Finally, these prediction algorithms are useful for rapid preliminary analyses of large amounts of genomic data, but it must be emphasized that these are only predictions, which require experimental validation. Our approach was to be inclusive rather than exclusive, so overall these results probably represent an overestimation of the actual C. albicans secretome, especially since many ORFs in the genome database have not been confirmed experimentally and some ORFs may not be expressed. Alternatively, we may have inadvertently excluded secreted proteins, e.g. proteins encoded by ORFs not annotated by CandidaDB, particularly small ORFs that would not fulfil gene prediction criteria.

In future studies, we would like to examine the following questions using proteomics-based approaches to analyse C. albicans soluble secreted proteins: (a) can novel secreted proteins be identified, and what is their role in virulence?; (b) are there abundant proteins that are secreted but do not have signal peptides, and if so, how do they reach the extracellular space?; (c) what are the specific targeting signals in C. albicans that allow sorting of proteins to their proper intracellular destinations? Fortunately, the extensive work done in S. cerevisiae will provide a roadmap toward answering some of these questions in this pathogenic yeast.


We thank Margaret Hostetter, Peter Novick and Craig Roy (all from Yale University) for helpful advice, and Birgit Eisenhaber (Research Institute of Molecular Pathology) for assistance with GPI-anchor predictions. We thank the Galar Fungail Consortium for CandidaDB, and the Stanford Genome Technology Center for the Candida albicans genome sequencing project. Sequence data for Candida albicans was obtained from the Stanford Genome Technology Center website at Sequencing of Candida albicans was accomplished with the support of the NIDR and the Burroughs Wellcome Fund. This work was supported by grants from the Department of Veterans' Affairs (Career Development Award to S.L. and Merit Review to B.W.) and the National Institute of Allergy and Infectious Diseases (R01 AI-47442 to B.W.).