Genome-wide identification of fungal GPI proteins

Authors

  • Piet W. J. de Groot,

    Corresponding author
    1. Laboratory for Microbiology, Swammerdam Institute for Life Sciences, University of Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands
    • Swammerdam Institute for Life Sciences, University of Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands.
    Search for more papers by this author
  • Klaas J. Hellingwerf,

    1. Laboratory for Microbiology, Swammerdam Institute for Life Sciences, University of Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands
    Search for more papers by this author
  • Frans M. Klis

    1. Laboratory for Microbiology, Swammerdam Institute for Life Sciences, University of Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands
    Search for more papers by this author

Abstract

Glycosylphosphatidylinositol-modified (GPI) proteins share structural features that allow their identification using a genomic approach. From the known S. cerevisiae and C. albicans GPI proteins, the following consensus sequence for the GPI attachment site and its downstream region was derived: [NSGDAC]–[GASVIETKDLF]–[GASV]–X(4,19)–[FILMVAGPSTCYWN](10)>, where > indicates the C-terminal end of the protein. This consensus sequence, which recognized known GPI proteins from various fungi, was used to screen the genomes of the yeasts S. cerevisiae, C. albicans, Sz. pombe and the filamentous fungus N. crassa for putative GPI proteins. The subsets of proteins so obtained were further screened for the presence of an N-terminal signal sequence for the secretion and absence of internal transmembrane domains. In this way, we identified 66 putative GPI proteins in S. cerevisiae. Some of these are known GPI proteins that were not identified by earlier genomic analyses, indicating that this selection procedure renders a more complete image of the S. cerevisiae GPI proteome. Using the same approach, 104 putative GPI proteins were identified in the human pathogen C. albicans. Among these were the proteins Gas/Phr, Ecm33, Crh and Plb, all members of GPI protein families that are also present in S. cerevisiae. In addition, several proteins and protein families with no significant homology to S. cerevisiae proteins were identified, including the cell wall-associated Als, Csa1/Rbt5, Hwp1/Rbt1 and Hyr1 protein families. In Sz. pombe, which has a low level of (galacto)mannan in the cell wall compared to C. albicans and S. cerevisiae, only 33 GPI candidates were identified and in N. crassa 97. BLAST searches revealed that about half of the putative GPI proteins that were identified in Sz. pombe and N. crassa are homologous to known or putative GPI proteins from other fungi. We conclude that our algorithm is selective and can also be used for GPI protein identification in other fungi. Copyright © 2003 John Wiley & Sons, Ltd.

Introduction

Glycosylphosphatidylinositol-modified (GPI) proteins are widely found in lower and higher eukaryotic organisms (Eisenhaber et al., 2001). In fungi, GPI proteins are known to be either covalently incorporated into the cell wall network or to remain attached to the plasma membrane. Various functions have been suggested for them. They may be involved in cell wall biosynthesis and cell wall remodelling, they may determine surface hydrophobicity and antigenicity, and they are thought to have a role in adhesion and virulence (Hoyer, 2001; Klis et al., 2001; Sundstrom, 2002). The predicted amino acid sequences of GPI proteins conform to a general pattern. At their N-termini, they have a hydrophobic signal sequence that directs the protein to the ER. At their C-termini, GPI proteins have a second hydrophobic domain, which is cleaved off and replaced with a GPI anchor, a preformed lipid in the membrane of the endoplasmic reticulum (Orlean, 1997). Two independent studies aimed at genome-wide identification of the GPI proteins of Saccharomyces cerevisiae have earlier shown that sequence motifs can be successfully used to identify most of the GPI proteins (Caro et al., 1997; Hamada et al., 1998a). Hamada and co-workers used three criteria to select for GPI proteins: (a) the presence of a hydrophobic tail; (b) the presence of a N-terminal signal peptide for secretion; and (c) the presence of a serine/threonine (S/T)-rich region. Although often found in GPI proteins, the presence of a S/T-rich region has not shown to be an absolute prerequisite for GPI proteins to become attached to a GPI-anchor and therefore Caro et al. (1997) did not use this feature as a selection criterium. With the first selection criterium being the presence of a N-terminal signal peptide for secretion, final identification of GPI proteins by Caro et al. (1997) was based on a more specific definition of the GPI-attachment site (ω-site) and its downstream sequences in the C-terminal region of GPI proteins as described by Udenfriend and Kodukula (1995) and Nuoffer et al. (1993). In detail, the ω site could be an Asn, Ser, Gly, Ala, Asp or Cys residue followed by two amino acids with relatively short side chains, the ω + 2 position being more critical (Figure 1). The ω region is followed by a less defined spacer region of about 10 amino acids followed by a hydrophobic tail. The length of the C-terminal signal peptide normally varies between 15 and 30 residues (Udenfriend and Kodukula, 1995) but, as was shown in S. cerevisiae for Gas1, signal peptides of 31 residues do occur (Nuoffer et al., 1991). In the set of S. cerevisiae GPI proteins so obtained, Asn or Gly predominated at the GPI attachment site (Caro et al., 1997).

Figure 1.

GPI-specific sequence features. (A) GPI algorithm described by Caro et al. (1997). (B) Modified GPI algorithm, designed to recognize S. cerevisiae and C. albicans GPI proteins more selectively. Residues in square brackets indicate the possible amino acids at a particular position; number(s) in round brackets indicate (the limits of) the length of a sequence element; X denotes any amino acid; and > denotes the C-terminus of the protein. The GPI attachment site (ω) is indicated below the algorithm; the cleavage site is marked with an arrow

The function of the hydrophobic tail is to retain the protein on the membrane until GPI modification. The length and the hydrophobic character of the tail, rather than its specific sequence, therefore seem to be relevant for GPI anchoring (Ikezawa, 2002). Consistent with this, Coyne et al. (1993) showed GPI modification of a truncated protein with a synthetic tail of at least 11 Leu residues, whereas introduction of a single charge in the middle of the hydrophobic tail blocks GPI anchoring (Nuoffer et al., 1991).

What exactly determines the final destiny of GPI proteins of S. cerevisiae, cell wall or plasma membrane, is not exactly known. The presence of basic residues in the region immediately upstream of the GPI attachment site (ω region) is favoured in proteins that are predominantly localized in the plasma membrane (Vossen et al., 1997). On the other hand, the presence of V, I or L at 4 or 5 amino acids upstream of the ω site (ω-4, ω-5) and Y or N at ω-2 has been shown to act as a positive signal for cell wall localization (Hamada et al., 1998b, 1999). Thus, GPI protein localization seems at least partly to be determined by the amino acids in the ω region.

In this paper, we aimed at developing an algorithm for the identification of fungal GPI proteins using S. cerevisiae and Candida albicans proteins, for which GPI anchorage is known or likely, as positive controls. We have validated the algorithm on the S. cerevisiae genome and tested the predictive power of our algorithm on the human pathogenic yeast C. albicans, the fission yeast Schizosaccharomyces pombe and the filamentous ascomycete Neurospora crassa.

Materials and methods

Sequence retrieval

The non-redundant open reading frames (ORFs) from the C. albicans genome were retrieved from CandidaDB (http://genolist.pasteur.fr/CandidaDB/). This genome database was created by the EU-funded consortium ‘Galar Fungail’ by performing independent annotation of assembly 19 sequence data obtained from the Stanford Genome Technology Center (http://www-sequence.stanford.edu/group/candida). The S. cerevisiae genome sequence was retrieved from the Saccharomyces Genome Database (http://genome-www.stanford.edu/Saccharomyces). For one S. cerevisiae ORF, Ecm33/Ybr078w, we modified the C-terminal region of the peptide sequence because it is now known that this ORF encodes a protein of 429 instead of 468 amino acids (Hamada et al., 1999). Annotated genome sequences of Sz. pombe were obtained from the Sanger Institute (Hinxton, UK) at http://www.sanger.ac.uk/Projects/S_pombe and the annotated N. crassa genome sequence (release 3: 2.12.2002) was downloaded from the Neurospora crassa Genome Database of the Whitehead Institute/MIT Center for Genome Research (http://www-genome.wi.mit.edu).

In silico analysis

Searches for proteins having a potential GPI anchor addition signal at their C-termini, as defined by our algorithm, were performed using the program FUZZPRO from the EMBOSS software package at http://www.hgmp.mrc.ac.uk/Software/EMBOSS/. The selected ORFs were screened further for the presence of a signal sequence for import into the ER using SignalP version 2.0 at http://www.cbs.dtu.dk/services/SignalP-2.0/. SignalP version 2.0 uses two signal peptide prediction methods, one based on Neural Networks (SignalP-NN; Von Heijne, 1986) and one based on Hidden Markov Models (SignalP-HMM; Nielsen et al., 1997). The standardized threshold value for signal peptides both in SignalP-NN (Smean) and SignalP-HMM (Sprob) is 0.5. To avoid any false positives we used a cut-off value of 0.6 for both methods. Only proteins that are predicted to have a signal peptide by both methods (SignalP-NN and SignalP-HMM) were selected for further analysis. The presence of integral transmembrane (TM) domains was analysed with TMHMM version 2.0 (http://www.cbs.dtu.dk/services/TMHMM/) and PSORT II (http://psort.nibb.ac.jp). TMHMM predicts TM domains with a Hidden Markov Model based on the presence of hydrophobic regions of at least 18 residues (Krogh et al., 2001) and PSORT II uses the method that was described by Nakai and Kanehisa (1992), using a threshold value of 2.0. Proteins that do not have any internal TM domain according to PSORT II and TMHMM were listed as putative GPI proteins. PSORTII was also used for protein localization predictions. BLASTP searches were performed at NCBI (http://www.ncbi.nlm.nih.gov/BLAST). The results of our in silico analysis are freely accessible at http://www.pasteur.fr/recherche/unites/Galar_Fungail/.

Results

Developing a fungal-specific GPI algorithm

In order to develop an algorithm that would selectively identify GPI proteins of the pathogenic yeast C. albicans and other fungi, we analysed the C-terminal regions of known and predicted S. cerevisiae GPI proteins identified by Caro et al. (1997), who used an algorithm (Figure 1A) that was based on sequence requirements for GPI anchoring as described by Coyne et al. (1993). To make the algorithm more selective for GPI proteins we made some adjustments in accordance with Nuoffer et al. (1993) and Udenfriend and Kodukula (1995). In detail, the requirement for N or G at the GPI-attachment site (ω site) was made less stringent, whereas the presence of charged residues in the 11 most C-terminal amino acids, the hydrophobic tail, was not allowed. We also found that none of the GPI proteins for which GPI anchorage is supported by biochemical data contained glutamine (Q) in their hydrophobic tail. With the new algorithm we then tried to identify C. albicans proteins that were known or predicted, based on homology with S. cerevisiae proteins, to be GPI-anchored. Among these C. albicans proteins were Phr1, Phr2, Phr3, Plb proteins, Kre1, Hwp1, Hyr1 and Als proteins. To identify these C. albicans proteins with the algorithm, small adjustments were required. The uncharged or hydrophobic tail was shortened to 10 amino acids, the region between the hydrophobic tail and the ω site was extended to a maximum of 19 amino acids and L and F were allowed at position ω + 1. The length of the C-terminal signal peptide as defined by this algorithm can vary from 16 to 31 residues (Figure 1B).

Validation of the algorithm using the S. cerevisiae genome sequence

To test whether the algorithm thus created was still selective for S. cerevisiae GPI proteins, we analysed the S. cerevisiae genome, which was downloaded from Stanford Genome Database (SGD). Using the FUZZPRO program from the EMBOSS software package our algorithm selected 187 proteins (Table 1). These proteins were further analysed for the presence of a N-terminal signal peptide for secretion using SignalP version 2.0, which includes SignalP-NN and SignalP-HMM. Proteins giving a positive signal peptide prediction with both methods were subsequently analysed for the absence of internal transmembrane (TM) domains using TMHMM version 2.0 and PSORTII. Proteins that do not contain internal TM domains according to these methods were listed as predicted GPI proteins. In this way we selected 66 putative GPI proteins in S. cerevisiae (Table 2), which is considerably more than the 51 proteins that were selected with the screening method used by Caro et al. (1997). For two ORFs in our list, Kre1 (a known GPI protein) and YMR158W-A (an unknown ORF), the TM domain prediction was not unambiguous because TMHMM recognized a possible TM domain, but PSORT II did not.

Table 1. Genome-wide identification of putative GPI proteins. Four different fungal species were analysed using the C-terminal GPI algorithm [NSGDAC]–[GASVIETKDLF]–[GASV]–X(4,19)–[FILMVAGPSTCYWN](10)>. ORFs with a C-terminal GPI motif were screened for additional GPI protein characteristics, as described in the text
Fungal speciesNumber of ORFs analysedNumber of ORFs with GPI motifNumber of putative GPI proteins
S. cerevisiae 6332187 66
C. albicans 6726237104
Sz. pombe 4950119 33
N. crassa10 085288 97
Table 2. Putative GPI proteins of Saccharomyces cerevisiae. S. cerevisiae GPI proteins were identified with the C-terminal GPI algorithm in a non-redundant genome file obtained from SGDa
(A) ORFProtein nameS/T content (%)Size (aa)C-terminal 42 residuesbNCBI BLAST result
YAR066W 29 203SSGDISLSLSKAKKGEVTFSPYSN GTFSLSNAILNGGSVSGLSim. Awa1, cell wall protein
YBR078WEcm3328 429ANVSASASSSSSSSKKSKG AAPELVPATSFMGVVAAVGVAYYEcm33 family, cell wall biogenesis
YDR134C 30 136KPTEKPTQQGSSTQTVTSYTG AAVKALPAAGALLAGAAALLLSim. flocculation proteins
YDR534CFit139 528AGIFTNGKSSTTPQIVNYTG AADSIAAGTGLMGAALAAVIFLCell wall protein
YGL228WShe1016 577EKAAAEEFQRQQELLRQQEEEDEED VSYTSTSTITTTITMTL 
YHR143WDse246 325TCYVFYDDDDYYSTVYLTNPSQSVD AATTITSTNTIYATVTIDaughter-specific expression
YHR204WMnl113 796HSPVLTSNGTREEDEFKMDG IGINDHSQLMLECTPIINLFIVα-Mannosidase, CAZy family 47c
YHR214W 29 203SGDISLSLSKAKKGEVTFSPYSN AGTFSLSNAILNGGSVSGLSim. Awa1, cell wall protein
YIR039CYps622 537DASSFSSSGGSSESTTKKQN AGYKYRSSFSFSLLSFISYFLLAspartic protease
YJR153WPgu124 361TAKRVKILVKNATNWQWS GVSITGGSSYSGCSGIPSGSGASCEndopolygalacturonase, CAZy family 28
YKL096W-ACwp227  92SSTVETVSPSSTETISQQTEN GAAKAAVGMGAGALAAAAMLLCell wall protein
YKR073C 16 106QSALGSRSLTCTCPLIS AVQLHLERSFFSLAFFAVILLYIPG 
YLR390W-ASsr142 238QETVSSALPTSTAVISTFSEG SGNVLEAGKSVFIAAVAAMLICell wall protein
YMR103C 15 120ERKDSLFCLLPLFLHSLGREQLIS SADDPGFPCAGSAMGSLT 
YMR158W-Ad 26 106ITSFNLRISCSLLFSSFCRIA SVFNKASSWLTMLPPMAPLLS 
YMR238WDfg515 458NSSTTNVLQNNLNIKKGDRA GAAIITAVILSVLTGGAVWMLFα-1,6-Mannanase, CAZy family 76
YMR251W-AHor716 237SAANSSNSSSSKNAAQPIAGLNNG KVAGAAGVALAGALAFLI 
YNR067CDse4271117FDPVDSYAFFSDSTFD SSTYLDNGMSRTWALAFSGGLANSIAβ-1,3-Glucanase, CAZy family 81
YOL011WPlb318 686IPSATATLEKKAATNS GSHLSGISVKFSAMIMLTLLMFTGAVPhospholipase B
YOL052C-ADdr215  61VNAQNASNTTSNAAPALHAQNGQLLN AGVVGAAVGGALAFLI 
YOL154WZps115 249EEVVDYTQNNATYAVRNTDNYLYYLA DVYSASVIPGGCLGNLSimilarity with CaPra1
YOR382WFit241 153STSSSSSSSSSSSSASSSG AAPAAFQGASVGALALGLISYLLCell wall protein
(B) ORFeProtein nameORFProtein nameORFProtein nameORFProtein name
  • a

    (A), ORFs that were not and (B), ORFs that were identified by the screening method used in Caro et al. (1997). Sim, similarity with; Ca, C. albicans.

  • b

    Putative GPI-signal peptide cleavage sites are indicated with a space. Note that other cleavage sites cannot always be excluded.

  • c

    CAZy, Carbohydrate-active enzyme classification according to Coutinho and Henrissat (1999).

  • d

    TM domain prediction not unambiguous.

  • e

    C-termini, size and S/T content of these ORFs can be accessed at http://www.pasteur.fr/recherche/unites/Galar_Fungail.

YAL063CFlo9YGR189CCrh1YKR102WFlo10YMR307WGas1
YAR050WFlo1YHR126C YLR040C YNL190W 
YBR067CTip1YHR211WFlo5YLR110CCcw12YNL300W 
YCL048W YIL011WTir3YLR120CYps1, Yap3YNL322CdKre1
YCR089WFig2YIR019CMuc1, Flo11YLR121CYps3YNL327WEgt2
YDR055WPst1YJL078CPry3YLR194C YNR044WAga1
YDR077WSed1YJR004CSag1, Aga1YLR343WGas2YOL132WGas4
YDR261CExg2YJR150CDan1YMR006CPlb2YOL155C 
YEL040WUtr2, Crh2YJR151CDan4YMR008CPlb1YOR009WTir4
YER011WTir1YKL046CDcw1YMR200WRot1YOR010CTir2
YER150WSpi1YKL096WCwp1YMR215WGas3YOR383CFit3

Three S. cerevisiae proteins, for which GPI attachment has been suggested based on structural similarity, were not recognized by our C-terminal algorithm. Two of these, Yps2/Mkc7 and Crr1, were also not picked up with the algorithm that was used by Caro et al. (1997). Both have multiple positively charged residues in the C-terminal hydrophobic tail. Surprisingly, the aspartic protease Yps2 has been reported to be cleaved from membranes using a GPI-specific phospholipase (Komano and Fuller, 1995). Possibly, the C-terminal region of Yps2 may contain sequencing errors. Gas5 (YOL030w), a protein belonging to the Gas1 family of GPI proteins, was not recognized by our algorithm, because it has a negatively charged glutamic acid residue in the hydrophobic tail. Our algorithm also did not recognize four unknown ORFs (YPL130w, YOR214c, YJL171w and YLR042c) that were identified by Caro et al. (1997). Finally, YPL261c and Sps2 (a protein with homology to GPI protein Ecm33), which were also identified in their screen for GPI proteins, were excluded from our GPI protein list because their N-terminal regions are unlikely to be signal peptides for secretion.

For 16 of the 22 additionally identified proteins, GPI modification has been shown or predicted, such as for the cell wall proteins Cwp2, Ssr1, Fit1, Fit2, and Yar066w and Yhr214w, which both show similarity to the sake yeast specific cell wall protein Awa1 (Shimoi et al., 2002). Dfg5, which shows similarity to bacterial endomannanases (Kitagaki et al., 2002) and is required for filamentous growth, Dse4 which shows similarity to endo-1,3-β-glucanases, and the aspartic protease Yps6 and phospholipase Plb3 are associated with the plasma membrane. The new algorithm therefore seems to be more effective and selective in recognizing GPI proteins than the earlier one. One of the newly identified proteins, Mnl1, which has homology to α-mannosidases of carbohydrate-active enzyme family 47, has been reported previously to be localized in the endoplasmic reticulum (Nakatsukasa et al., 2001). This localization was performed with a HA-tag fused to the C-terminus of the protein. In the case of GPI anchor modification, this would cause mislocalization of the protein and this might explain why Mnl1 was not found at the cell surface. Furthermore, the Mnl1 peptide sequence does not contain an ER retention motif, it has no predicted TM domains (according to both the TMHMM and the PSORT II program) and PSORT II predicts this protein to be localized at the cell surface with a high probability of 67%. GPI proteins usually have a high percentage of S and T residues, the side-chains of which are potential sites for O-glycosylation (Klis et al., 2002). The S/T content in our set of GPI proteins varies from 13% to 55% with an average of 25% for the 22 new proteins. This also indicates that at least the majority of these proteins are GPI proteins.

For many of the GPI proteins we identified, our algorithm recognizes multiple sites at which cleavage of the GPI anchor signal peptide could occur. According to Udenfriend and Kodukula (1995), the length of this signal peptide can vary (15–30 amino acids, with an average of 23 amino acid residues). Analysing the known GPI proteins from S. cerevisiae, we also found an average length of 23 amino acids for the GPI signal peptide. Therefore, in Tables 2–3 we have indicated signal peptides that are preferentially of about this length.

Table 3A. Putative GPI proteins of Candida albicans that give significant BLAST results. C. albicans GPI proteins were identified with a C-terminal GPI algorithm in a non-redundant genome file based on Stanford assembly 19
Stanford ORF19. No. allele 1/allele 2Protein nameS/T content (%)Size (aa)C-terminal 42 residuesbNCBI BLAST result
  • a

    The CandidaDB IPF number is given if the ORF is not present in Stanford's Assembly 19 database. Translational start codon not identified.

  • b

    Putative GPI signal peptide cleavage sites are indicated with a space. Note that other cleavage sites cannot always be excluded.

  • c

    For ORFs that are broken at contig ends, corresponding N- and C-terminal parts in the genome file are combined.

  • d

    CAZy, Carbohydrate-active enzyme classification according to Coutinho and Henrissat (1999).

  • e

    TM domain prediction is not unambiguous. Sim, similarity with; Sc, S. cerevisiae; Pga, predicted GPI-anchored protein.

5741/13163Als1.5′ + 3′c37658 + 828QQVTSSSPSTNTFIASTYD GSGSIIQHSTWLYGLITLLSLFIAgglutinin-like sequence
2355/9891Als10351586LSQQMTSSLVSLHMLTTFDG SGSVIQHSTWLCGLITLLSLFIAgglutinin-like sequence
5745/13168Als11.5′ + 3′37952 + 949LSQSLISSSTKTVIASTYDG SGSVIKLHSWFYGLVTIFFLFIAgglutinin-like sequence
2122/9670Als12.5′ + 3′361068 + 220LSQQTTSSLISTPLASTFDG SGSIVQHSGWLYVLLTAISIFFAgglutinin-like sequence
1816/9379Als3.5′ + 3′33886 + 111LSQQMTSSLVSLHMLTTFDG SGSVIQHSTWLCGLITLLSLFIAgglutinin-like sequence
4556Als4.5′ + 3′351593 + 546LSQQTTSSLISTPLASTFDG SGSIVQHSGWLYVLLTAISIFFAgglutinin-like sequence
7414Als5381347IQQVATSSYNQPLITTYAG SSSATKHPSWLLKFISVALFFFLAgglutinin-like sequence
7414Als6391366IQQVATSSYNQPLITTYAG SSSATKHPSWLLKFISVALFFFLAgglutinin-like sequence
7400Als7352000TVTEQYDTSTYTPASLLVSDN SGSVSKYSLWMMAFYMLFGLFAgglutinin-like sequence
5742/13164Als9.5′33894PDGTNSVIVKEPHNPTVTTTEFWSES FASTTTITNPSTVLIVAgglutinin-like sequence
7517Cht123462TTVGSQMLQSLFDKRDVIAEA KSTNLQICWLLFIPLLALICSChitinase, CAZy family 18d
3895/11376Cht225583SVASNGTNTTVPVFTFEGG AAVANSLNSVWFPVPFLLAAFAFChitinase, CAZy family 18
2706/10221Crh1129453AAPSSSASEKPSVSTTENN GAVSVAKTTSLFGFVALIGFLFVCrh1 family, CAZy family 16
7114Csa1291018GVSQATVAANTHSVA IANMANTKFASTMSLLVASFVFVGLFIMycelial surface antigen
2075/9622Dfg514451TNSEDNANKNELTITGKDKA GAGVLTAIVLAVILGGAIWMIFSim.ScDfg5/Dcw1, CAZy family 76
CaDB 5366.2aEcm33.325412GSDSSSSASGSSSSSKKGG AASNNGKLASVVAAFAAVGVALFEcm33 family, cell wall biogenesis
4255/11730Ecm33124413SKSSDGSSSSNSSSSSKKG ASNVLVVPGMVLTTALGVLLALIEcm33 family, cell wall biogenesis
2952/10469Exg214479IMPQPLDNYKYVKNGTDTS SASAIASNKMTLLLAFLLVILVIGlucan 1,3-β-glucosidase, CAZy family 5
1321/8901Hwp129634VPLMQPSANYSSVAPISTFEG AGNNMRLTFGAAIIGIAAFLIHyphal cell wall protein
4975/12440Hyr132937SNTTDSSSSVPTIDTNENG SSIVTGGKSILFGLIVSMVVLFMHyphally regulated protein
575/8206Iff2361249ADSGSTSSASAMVIPNTNG SGKLLNGKVLTLSVLSSMVVVFLSimilarity with Hyr1
4361/11839Iff333941VITPTNSSSSAVTISYEN GSNKESIENIKYLALVVFGLMMFMSimilarity with Hyr1
7472Iff4351526SQTIETGSSSFTAIPFEN GSTNISNKYLKFLGTVVSILILLISimilarity with Hyr1
2879/10397Iff5361311SETGSPATSIVGTLPYEN GSNQLSIENIKYLILVVFGLMMIMSimilarity with Hyr1
4072/11553Iff6381087TSTQGPSSSNSATIPEQAN SGNHIKFTLFNGLLIGLVPIVFMSimilarity with Hyr1
3279/10789Iff7271225TYSAGVGGSNVSGLISKSES VVLLIRPVMIFVFLAICVVIMLSimilarity with Hyr1
570/8201Iff835714SSSSSSSGTPGEVIPNANG SSKLSIGMTFMISGFATMFALFMSimilarity with Hyr1
4377/11855Kre128130TISQGNGGLSKFNQNGLEMKN LSFVKLIGVSFIAFISFILLIScKre1 homolog
3212/10724Mid118559HVSDQWPSCNYLGPLKSSKS GAIKLIINWTIVVVSFTLMVVICa2+-permeable channel protein
7625ePga123132KTYSQTTITNAN IAPSNNNVFLGIESLYTGIVGGIVLILGLLScKre1 homolog
2062/9609Pga225232SKNSTNGSSGSSTSASQG SGAGRAEISGFLAAGIAGVVAALICu/Zn superoxide dismutase
2060/9607Pga320228SQSAVNTSSSMASTAPQGN GAERAVVNGLLAAGVVGVIAALICu/Zn superoxide dismutase
4035/11518Pga418451TSQVSSSSATSANSTSSKKN DAAVEGAGFLSVIALAAGIALLGas/Phr family, CAZy family 17
3693/11177Pga514641SGEKSPKTSKSIAGGNAITFKN DSIWKTFIEILFTCSAAILIGas/Phr family, CAZy family 17
4765/12229Pga635219SSSSSVVPSANVTTFEGGAVG GASNQITVGFAAIAGLAAILLSimilarity with ScFlo1
5635/13080Pga737219GSSSTDSSSSSSSSPSSSAN FAVLQTGGIGSVILGFMMYLLVSimilarity with Csa1/Rbt5
3380/10888Pga848908ISVAVSSAAQSSIAAISSYEG TGNNMKLSFGVVIAGVAAFAISimilarity with Rbt1
2108/9656Pga913316SPKNYSNISIHGISSDCLN DGMMVTGSVFGSLVLGIAAGIFVCu/Zn superoxide dismutase
5674/13119Pga1026250ETSVAQSSSSANVASVSAETAN AGNMPVIAIGGVIAAFAALISimilarity with Csa1/Rbt5
3829/11310Phr121548SSTSSGSSSSS GVKATQQMSMVKLVSIITIVTAFVGGMSVVFGas/Phr family, CAZy family 17
6081/13500Phr222544SSSSSSGSSGSKSAA SIVSVNLLTKIATIGISIVVGFGLITMGas/Phr family, CAZy family 17
6594Plb317632NFTTEGTTNDTESMKVN SGVLTSVSWSLWAMMLAALSVFILGPhospholipase B
1443/9018Plb4.5′ + 3′18394 + 304DTQNNDEKEEFIGVVRESNS DSLKLSKYLTIASLFALYLVIMPhospholipase B
5102/12568Plb522754SSSTGSSSSSKKKNGGDLVN GGVPSSIFLVFNSLLGLIIAYLPhospholipase B
1327/8907Rbt131714VPLMQPSANYSSVAPISTFEG AGNNMRLTFGAAIIGIAAFLIRepressed by Tup1 protein 1
5636/13081Rbt522241SSVAQSSSSAADVASVSVEAAN AGNMPAVAIGGVIAAVAALFRepressed by Tup1 protein 5, sim. Csa1
6928/14190Sap923544SGTTSSSGTSTSTSTRHS AGSIISNPVYGLLLSLLISYYVLVAspartyl protease
2237/9779Spr113510FDYNSNSTTTMTTSSPKKNGCSIIN VGESWLWIIFVYYLSIFGlucan 1,3-β-glucosidase, CAZy family 5
7030Ssr138234STTAAASSSESTTATGVLTQSEG SAAKVGLGALVGLVGAVLLN-term. has similarity with ScSsr1
1671/9240Utr223470DSKSTDSGSSGSSSQG VANSLNESVISGIFASICLGILSFFMCrh1 family, CAZy family 16
1127/2221Yck3.5′ + 3′14378 + 142AYLQQNGMIPIRDN DVELHSDMSSYTRKRSGTWLSWIFCCCSCasein kinase I
Table 3B. Putative GPI proteins of Candida albicans that do not give significant BLAST results. C. albicans GPI proteins were identified with a C-terminal GPI algorithm in a non-redundant genome file based on Stanford assembly 19
Stanford ORF19 No.aallele 1/allele 2Candida DB nameS/T content (%)Size (aa)Stanford ORF19 No.aallele 1/allele 2CandidaDB nameS/T content (%)Size (aa)
  • a

    C-termini of these ORFs can be accessed at http://genolist.pasteur.fr/CandidaDB.

  • b

    The CandidaDB IPF number is given if the ORF is not present in Stanford's Assembly 19 database. Pga, predicted GPI-anchored protein.

  • c

    For ORFs that are broken at contig ends, corresponding N- and C-terminal parts in the genome file are combined.

7609Pga11151242759/10272Pga3840 517
7597Pga12163156302/13679Pga3940 281
6420/13778Pga13364561616/9183Pga40361114
968/8583Pga14241312906/10424Pga4135 296
2878/10396Pga15221652907/10425Pga4229 227
848/8468Pga16181052910/10428Pga4314 236
893/8512Pga17205571714/9282Pga4428 307
300/7933Pga18467532451/9987Pga4521 462
2033/9581Pga19162293638/11120Pga4618 332
535/8168Pga20311111401/8979Pga4735 653
532/8165Pga21271686321Pga4825 108
3738/11223Pga22341694404/11882Pga4924 734
3740/11225Pga23292711824/9383Pga5024 309
3618/11101Pga24265331989/9540Pga5115 452
6336/7542Pga25. 5′ + 3′c31361 + 8971911/9467Pga5223 384
2475/10012Pga26281314651/12120Pga5333 139
2044/9592Pga27224852685/10200Pga5439 342
5144/12609Pga2833226207/7838Pga55441176
5305/12765Pga2918204CaDB15957.2bPga5624 84
5303/12763Pga30252774689/12158Pga5730 214
5302/12761Pga31252934334/11809Pga5820 240
6784/14076Pga32394302767/10283Pga5929 113
876/8495Pga33302625588/13035Pga6029 741
2833/10351Pga34331975762/13185Pga6127 223
4910/12376Pga35382502765/10281Pga6233 213
5760/13183Pga36363926216/13597Pga6331 148
3923/11405Pga3731208   

C. albicans GPI proteins

C. albicans GPI proteins were selected from a non-redundant genome file containing 6726 ORFs that was created by independent annotation of Stanford's contig assembly 19 by members of the EU Framework V program Galar Fungail. From this file, our algorithm selected 237 ORFs, and further analysis of these ORFs resulted in 104 putative GPI proteins (Table 3A, 3B). Extending the undefined spacer region of 19 amino acids in our algorithm with one residue resulted in eight additional proteins, neither of which have a predicted signal sequence according to SignalP version 2.0. For Pga1, which has homology to the GPI proteins ScKre1 and CaKre1, TMHMM recognized a possible TM domain but PSORT II did not (Table 3A). Some ORFs in genome assembly 19 are broken at the contig ends. For those ORFs, corresponding N- and C-terminal parts in the genome file were combined as indicated in Table 3A. For Ecm33.3, a reliable signal peptide prediction is lacking because a truncated version (no translational start codon found) is present in CandidaDB. This ORF has been confirmed by cDNA sequencing (Fradin and Hube; see CandidaDB), and has a C-terminal GPI anchor sequence, homology with GPI proteins and an S/T content of 25%. The overall S/T content in the C. albicans GPI proteins varies from 13% to 48% with an average of 28%, which is comparable to what we found in S. cerevisiae. For 51 of the selected ORFs, BLAST results indicated that they are related to known or putative GPI proteins. Proteins that did not yet receive a functional name are now, in accordance with CandidaDB, tentatively named Pga for predicted GPI-anchored proteins.

Among the identified ORFs are members of the Als protein family (10 ORFs), Hyr1, 7 Iff/Hyr1-related ORFs, Csa1, Rbt5, two Csa1/Rbt5-related ORFs, Hwp1, Rbt1 and a Hwp1/Rbt1-related ORF, all (related to) typical C. albicans proteins. For Als proteins and Hwp1 cell wall association has been shown immunologically by releasing proteins with β-1,6-glucanase and β-1,3-glucanase, respectively (Kapteyn et al., 2001, Sundstrom, 2002). In C. albicans nine ALS genes have been discovered thus far (Hoyer, 2001). Proper annotation of this protein family in CandidaDB has been hampered because these proteins have regions that are almost identical, they have internal repeats and they are large (ca. 900–2000 + residues in CandidaDB), which may have led to erroneous alignment of contigs. Additionally, we found orthologues of known S. cerevisiae GPI proteins, e.g. Ssr1, Dfg5, Cht2, Sap9 and Kre1, and we also found orthologues of known S. cerevisiae GPI protein families, e.g. four ORFs of the Gas family, three phospholipases (Plb3, 4 and 5), two members of the Ecm33 family, two Crh-family proteins and two chitinases (Cht1 and Cht2). By mass-spectrometric analysis of cell wall tryptic digests, Cht2 has recently been shown to be indeed cell surface-associated (Iranzo et al., 2002).

Identification of putative Sz. pombe GPI proteins

Besides S. cerevisiae and C. albicans, GPI proteins have recently also been identified in a variety of other fungi. Examples are the agglutinin-like sequence (Als) proteins in Candida tropicalis and Candida dubliniensis (Hoyer et al., 2001), the Epa1 adhesin of Candida glabrata (Frieman et al., 2002), Gas/Phr homologues in C. glabrata (Weig et al., 2001), Candida maltosa (Nakazawa et al., 2000) and Aspergillus fumigatus (Mouyna et al., 2000), Fem1 of Fusarium oxysporum (Schoffelmeer et al., 2001), MP1 of Penicillium marneffei (Cao et al., 1998), CAP22 of Glomerella cingulata (Hwang and Kolattukudy, 1995), Ag2 of Coccidioides immitis (Zhu et al., 1996), Psu1 (Omi et al., 1999), Yps1 (Ladds and Davey, 2000) and the Dfg5 homologue SPCC970.02 (Kitagaki et al., 2002) of Schizosaccharomyces pombe, and the Dfg5 homologue NCU03 770.1 of Neurosporacrassa (Kitagaki et al., 2002). All these proteins have C-terminal regions that conform to the GPI algorithm, which suggests that the algorithm is applicable for identification of GPI proteins in other fungi. We therefore also applied our algorithm to the ORFs identified in the complete genome sequences of the fission yeast Sz. pombe and the filamentous ascomycete N. crassa. In the Sz. pombe file that was downloaded from the website of the Sanger Institute, we found 33 GPI protein candidates among 4950 ORFs (Table 1). In two of the selected ORFs, PSORT II recognized a possible TM domain whereas TMHMM did not. Both these proteins, SPCC1795.09 (aspartic protease) and SPCC757.12 (α-amylase), have homology with known or predicted GPI proteins from other fungi. For 19 of the identified proteins, BLAST analysis showed homology with known or putative GPI proteins (Table 4), which indicates that the number of false positives that we may have picked up will be low. As for C. albicans and S. cerevisiae we found, for instance, α-amylases, aspartic proteases and members of the typical Gas/Phr-, Ecm33- and Plb GPI protein families. The S/T content in Sz. pombe GPI proteins varies between 11% and 60% with an average of 29%. Interestingly, the 14 unannotated ORFS have an average S/T content of 41%, strengthening the notion that these are indeed GPI proteins. Remarkably, in SPBPJ4664.02, a large ORF of 3971 residues, ca. 260 copies of the amino acid sequence N[ST]STPITSST[AV][LV] are present. Considering that this protein is expected to be secreted and that it has one potential N-glycosylation site and seven potential O-glycosylation sites in each repeat, this indicates that it may be extremely heavily glycosylated. Multiple S/T-rich repeated sequences are also present in Als proteins (Hoyer, 2001) and hyphally-related sequences (HYR1; Bailey et al., 1996) of C. albicans and in C. glabrata Epa1 (Frieman et al., 2002), which are all yeast GPI cell wall proteins. BLAST analysis of the regions preceding and following the 12 amino acid repeats unfortunately did not give further clues about the function of SPBPJ4664.02.

Table 4. Putative Schizosaccharomyces pombe GPI proteins. Sz. pombe GPI proteins were identified with the C-terminal GPI algorithm in a non-redundant genome file created by the Sanger Institute
(A) ORFProtein nameS/T content (%)Size (aa)C-terminal 42 residuesaNCBI BLAST result
SPAC1002.13CPsu125 417KIVGADGASVSGTCVFENGAFQN GGSGCTVGITSGSGVFVFYSimilarity with ScSUN family
SPAC1039.11C 15 995GVQVESFSYLEDTKELVLTNLEAFTS TGAFSNNWTISWNLPVα-Glucosidase, CAZy family 31b
SPAC11E3.13C 25 510SSAHSSGSSSGSSSATSS ASTFNLSRFYVFAGILAISGLVFAGas/Phr family, CAZy family 17
SPAC1705.03C 26 421SYSSDSSASSSSSSSHESS AASNGFTAGALVLGSLLVAALAMEcm33 family, cell wall biogenesis
SPAC1786.02 18 644PVDDDSKNPTYNPAVKTSS ASGVHANILLSFFVLLATLLVTALysophospholipase
SPAC17A5.04C 18 512LCYNGVCVPIEGSS ASWSKQPSLFCASGTMLISLAVIAWFFWZinc metallopeptidase
SPAC19B12.02C 25 542SNSSGSSSNSSSKSSSG ASSYNLNMVITFLSVVIGGTAVLFIGas/Phr family, CAZy family 17
SPAC23D3.14C 11 581GQQFWNTLTAKS EAKTIRSFTKLKLFILLIAVPFALPMIILIα-Amylase, CAZy family 13
SPAC26A3.01Sxa125 533SSTSVDGSSSSDSSEAS GAASVGVSISAIVLCASTLISLLFAAspartic protease
SPAC27F1.07 14 450GVPSDIVHISYEYNS SAFFLRIAIITTLLILLAAAAYMIWAPOligosaccharyltransferase α-subunit
SPAC2E1P3.05C 40 197TISSSVSTSSFTSLSSSGFS TVLSSTNTTSALPSSGWNVTGYCellulose binding β-glucosidase
SPAC821.09Eng1231016QSYICYGNILCPIIN GSPLLACGNACYDSSIYGCSNGALVAAEndo-1,3-β-glucanase, CAZy family 81
SPBC16A3.13Meu7 9 774DYNSEGSLSINVEEAQTS KAPEQNRGFTNVLAALLLSLLMILα-Amylase, CAZy family 13
SPBC342.03 15 456NVTSTTSYTSGMTSSSES GSSKIGVAFCQALFITVLIATLSFGas/Phr family, CAZy family 17
SPBP4G3.02Pho115 453SDGMCELYAYLNSPVRVNG TSNGIQNFDTLCNASAVAAVYPYAcid phosphatase
SPCC1795.09dYps121 521IPSFNISLISQNAVAN AGNSFSPLSAMVIMMMSAVFLGLGIIAspartic protease
SPCC63.02C 18 564SARSFTGTGSIFTISS SSRLILSFKTLVFGLGVTAMLFVLFFα-Amylase, CAZy family 13
SPCC757.12d 25 625VTSTAYSSSSSSSSSSSIES SANAVRVSILGVAAFIAIVLFIα-Amylase, CAZy family 13
SPCC970.02 15 442TGDPNAGLYTAPVS FANKNFENLRKHWMLLGFFLLVPTLVLYSim.ScDfg5/Dcw1, CAZy family 76
  • a

    Putative GPI-signal peptide cleavage sites are indicated with a space. Note that other cleavage sites cannot always be excluded.

  • b

    CAZy, Carbohydrate active enzyme classification according to Coutinho and Henrissat (1999).

  • c

    C-termini of these ORFs can be accessed at http://genolist.pasteur.fr/CandidaDB.

  • d

    TM domain prediction is not unambiguous.

  • Sim, similarity with; Sc, S. cerevisiae.

  • (A) ORFs that do and (B) ORFs that do not give significant BLAST results.

(B) ORFcS/T content (%)Size (aa)ORFcS/T content (%)Size (aa)ORFcS/T content (%)Size (aa)
SPAC19G12.16C50670 SPBC1E8.0548 317SPCC1322.1042 262
SPAC1F8.02C22226SPBC215.1359 534SPCC1742.01441563
SPAC27E2.11C35 81SPBP19A11.02C42 244SPCC24B10.0624 156
SPAPB15E9.01C60743SPBPB7E8.0120 569SPCC553.1044 349
SPAPJ760.03C32166SPBPJ4664.02553971   

Identification of putative N. crassa GPI proteins

To search for GPI proteins in the filamentous fungus Neurospora crassa, we used assembly 3 of the Whitehead Institute. In N. crassa we found 97 GPI protein candidates in a file containing 10 085 ORFs (Table 1). Of the selected ORFs, only NCU09 729, a protein with homology to chitinase GPI proteins, gave an ambiguous TM domain prediction (PSORT II predicting one Tm domain, TMHMM predicting absence of TM domains). For 43 of the predicted GPI proteins, BLAST analysis revealed homology with known or putative GPI proteins (Table 5). This indicates that our algorithm also efficiently identifies GPI proteins in N. crassa. The S/T content in the putative N. crassa GPI proteins varies from 12% to 37% with an average of 21%, which is slightly lower than in S. cerevisiae, C. albicans and Sz. pombe. Similar to the results obtained for the three yeasts, we also found in N. crassa orthologues of the Gas family and the Crh family, but members of the Ecm33 family were not identified. Furthermore, we found a diverse range of proteolytic and carbohydrate-modifying enzymes and orthologues of known cell surface proteins, such as Sz. pombe Psu1 and C. albicans Csa1. Surprisingly, the GPI algorithm also selected the rodlet protein Ccg-2, which belongs to a class of hydrophobic cell surface proteins that share a common pattern of eight cysteine residues and are found in various filamentous fungi (Wessels, 1997) but not in yeasts.

Table 5. Putative Neurospora crassa GPI proteins. N. crassa GPI proteins were identified with the C-terminal GPI algorithm in a non-redundant genome file (release 2) that was obtained from the Whitehead Institute
(A) ORFProtein nameSize (aa)S/T content (%)C-terminal 42 residuesaNCBI BLAST result
NCU00086.1  369 14ANSAYQKARNADDTYG TSWSGPFDGSSLAKQQSAASLWVALLSim.ScDfg5/Dcw1, CAZy family 76b
NCU01050.1  238 19AKTPSTVSFPGAYS GSDPGVKISIYWPPVTSYTVPGPSVFTCEndoglucanase, CAZy family 61
NCU01162.1  457 17PSGSVESQKKSGSTRGIG SVDKAPFVVTGLVMFFTLTGTLLLGas/Phr family, CAZy family 17
NCU01353.1  725 24QPPLATATEDAPVMTAG AGKKNTLMSVGVAGVVAGVAALLALSim. Mlg, mixed-linked glucanase
NCU01418.1Ccg-6 142 27SAPAVVTPTVSPSEVPTAG AGKAAALSGAGLVGVLGLAAILLClock-controlled gene-6 protein
NCU02184.1 1434 20PPAQTEPVTGSEPSEVPVTAG AGRNVVAMGVPALVAALVLALChitinase, CAZy family 18
NCU02909.1  574 16ASTPTSASSGGAAEKNS TGAREMGSLWERVLLVVVLGFVMVLChitosanase
NCU02956.1  492 17GNDGKDENASAGMPSAFG VAQMSVMGIAMVFAMVGSGVFVLLAspartic protease
NCU03013.1  248 16STTGGAPSPSGTLPVAAG ASLSASSMTGVAAALLGGAMMLMLCu/Zn superoxide dismutase
NCU03168.1  529 19SGAKETDSAAAGLSARGG AVGALAVASLTGFLALVGGAVVALAspartic protease
NCU03530.1  260 31SALSSALQSQASSVSSPG AAPHQTAAPIAGVLAAAGFAAMLLChitinase
NCU03770.1  489 14DPASRGGLAALKPITMADRVG AGIVTAILAISIVGGSVFLTISim.ScDfg5/Dcw1, CAZy family 76
NCU04395.1Neg1 480 18VVIENKFGNEIYVTVEAKSG EVWSGLVYRNSVVTWVLPAAGAβ-1,6-Glucanase, CAZy family 30
NCU04542.1  224 20TNTAECKAIVARLQGKSGAGLN SAPVLAAVAAFAGLVGVLALChymosin
NCU04543.1  310 23GASGSGSAGASSTTVSGN VGAPKQTFAAAAALAGLVAAVAALChymotrypsin
NCU05071.1  351 23IQGTSDQSGCYGYSCVRQLS GSQNLKHADTYALFANSIYVGCNeutral protease
NCU05104.1  450 14GAARGDCPQSSGVPAEVESQYANS KVVYSNIRFGPVGSTVNVCellobiohydrolase, CAZy family 7
NCU05686.1  477 21CDDGSGSGSGGSSTKNN GVKTRDVQGASALAAVVGIVGLMVLCrh1 family, CAZy family 16
NCU05974.1  364 22SSSTSDATEATKSGNVVTGG AGKLAMSIASLGAALFAGAMLLCrh1 family, CAZy family 16
NCU06055.1  433 16LEGLTDVTAVGNRIKELAQKTGA KVTNNVRGTTSLIANNGNLAlkaline protease
NCU06185.1Mlg1 245 28GNGTVPTTGSPSPSAPVTAG AGAMTGSALLAAAAGLLTVFLAMixed-linked glucanase
NCU06381.1  406 22MTVASLPSGTSTAASSGKA SSIKGSMGDAIAVVMLAMALFGPSim. ScBgl2/CaBgl2, CAZy family 17
NCU06703.1  699 14TNGEVVCNYPGSAHIFSG SSKDAVSVGLTIVVLVLVSLLLACSim. SpYam8, Ca2+ channel protein
NCU06781.1  494 16SGSSSEATSDAKNTSGAAG SAVAMTVPAMAMGAFAIFAGLALGas/Phr family, CAZy family 17
NCU07063.1  551 18GGGSGDGNGNNDDDDSAA SGLDVGVTAAAVLAGLNMLIVWLLAspartic protease
NCU07159.1  396 19LGKKSPAALCSYIASTAN SGVISGIPRGTVNKLAFNGNPSAYSubtilisin-like serine protease
NCU07253.1Gel1 466 16SSASASATHNAAGFSVAGPVN KGPFVVTGLALLFTLVGAVALGas/Phr family, CAZy family 17
NCU07355.1  464 14VSVTRTLEDGTTLLIAQA KGGPAITTECVGGMVNWNSAVISSMutanase
NCU07454.1  186 15PIASEMPGASSTSAKPPAATG GAGRSAQIGLGAGIAVAFALLSimilarity with CaCsa1
NCU07533.1  603 19QLVVFDLGGMRVGFAGKELAEG VGKGETSARVPFGYYYPGIAAspartic protease
NCU07962.1  183 25SSIESAASSAATSTHTSNPGPRQTAA AGLGAIGGLMAAVALLSimilarity with CaCsa1
NCU08131.1  533 13EGTDICTKTTVDGPKEENG AVSKGAATGVLMAAVLGTASLLLα-Amylase, CAZy family 17
NCU08227.1  321 17VAGAAPKLAVGHLCSLA ASTNPPQWLFSSAIIMSATLLYLISEndoglucanase, CAZy family 12
NCU08457.1Ccg-2 108 17ASLGCVVGVIGSQCGA SVKCCKDDVTNTGNSFLIINAANCVAHydrophobin (rodlet protein)
NCU08909.1  542 13AAATSSSFAAPVAMKSFFTVADMA VGMYVVIAMGVGAGMVMLGas/Phr family, CAZy family 17
NCU09102.1  604 14RRMPVGGGLDKVVEVQG GGYEKVPVVKTDNGNIPGVYPLGVYPeptide amidase
NCU09117.1  375 21MKPDGKIVPVGAA SAMARPPQGLLVATSLVTVALAAFVGCLACrh1 family, CAZy family 16
NCU09155.1  518 30VGGSGSGSMSATALDS SGRSMAGDVVLSAAVAVGAAVLGSLLAspartic protease
NCU09175.1  410 19EPATVTAPAETAVATGSA TSVKGVSAAAVAGMTLLVGVFAMLSim. ScBgl2/CaBgl2, CAZy family 17
NCU09672.1  320 23DAADGNQSKEGIVKKDGSRG VASATNSTSSSSSSSSSSSSSSCrh1 family, CAZy family 16
NCU09729.1d  886 21GSGSGSAKDENEDEDAG TSLEVVWRVMMLGLSVAGAVGVFGLSim. glucosidases, chitinases
NCU09733.1  476 26LTETEKPAPVSTTSKKNDG AAQEGGAAIAGLVVAIVAAATLLSim. ScSUN family
NCU09775.1  342 17GDNSHGAQGTFYEGVMTTGYPS DAVEDKVQADVVAAGYATSSα-L-Arabinofuranosidase B
  • a

    Putative GPI-signal peptide cleavage sites are indicated with a space. Note that other cleavage sites cannot always be excluded.

  • b

    CAZy, Carbohydrate-active enzyme classification according to Coutinho and Henrissat (1999).

  • c

    C-termini, size and S/T content of these ORFs can be accessed at http://www.pasteur.fr/recherche/unites/Galar_Fungail.

  • d

    TM domain prediction is not unambiguous.

  • Sim, similarity with; Ca, C. albicans; Sc, S. cerevisiae; Sp, Sz. pombe.

  • (A) ORFs that do and (B) ORFs that do not give significant BLAST results.

(B) ORFs not giving significant BLAST resultsc
NCU00175.1NCU01403.1NCU03210.1NCU04496.1NCU06109.1NCU08090.1NCU09072.1NCU09647.1
NCU00267.1NCU01462.1NCU03222.1NCU04616.1NCU06607.1NCU08171.1NCU09099.1NCU09785.1
NCU00270.1NCU02041.1NCU03293.1NCU05096.1NCU07050.1NCU08193.1NCU09133.1NCU09851.1
NCU00449.1NCU02170.1NCU03602.1NCU05229.1NCU07277.1NCU08318.1NCU09225.1NCU09868.1
NCU00473.1NCU02880.1NCU03873.1NCU05318.1NCU07486.1NCU08523.1NCU09263.1NCU09929.1
NCU00881.1NCU02884.1NCU03895.1NCU05395.1NCU07776.1NCU08680.1NCU09525.1 
NCU01351.1NCU02900.1NCU04493.1NCU05667.1NCU07881.1NCU09055.1NCU09568.1 

Discussion

The use of algorithms to identify GPI proteins has been shown in the past to be a powerful method (Caro et al., 1997; Hamada et al., 1998a; Eisenhaber et al., 2001). Using such algorithms, two independent studies each identified about 50 putative GPI proteins in the yeast S. cerevisiae (Caro et al., 1997; Hamada et al., 1998a). Combining the output of these two studies indicates that approximately 70 GPI proteins are present in S. cerevisiae, as has already been suggested (Klis et al., 2002). With the algorithm used in the present study, we selected 66 candidate GPI proteins, including almost all known GPI proteins. The fact that we missed Gas5 and some unknown ORFs, among which are Yor214c and Ypl130w, in our screen indicates that an uncharged hydrophobic tail and the absence of glutamine residue(s), as demanded by our algorithm, may be slightly too strict. Alternatively, sequencing errors may be involved, as has been shown for Ecm33 (Hamada et al., 1999). Gas5 belongs to a family of GPI proteins that are involved in β-1,3-glucan remodelling but has itself not yet directly been shown to be GPI-anchored. For Yor214c and Ypl130w, incorporation into cell walls was observed when the C-terminal 40 amino acids were fused to reporter constructs (Hamada et al., 1999). These three ORFs have one charged residue in the hydrophobic tail. Thus, an even more complete set of proteins could possibly be obtained if an algorithm could be defined that allows one mismatch in this tail; however, this might also cause more false positives to be selected by our screen.

After optimizing the algorithm in such a way that all known C. albicans GPI proteins were recognized, we identified a set of 104 putative GPI proteins. This indicates that, in comparison to S. cerevisiae, C. albicans has an increased number of different GPI proteins. This may at least partly be explained by the presence in C. albicans of proteins and protein families that are specifically expressed during filamentous growth (e.g. Hwp1, Hyr-related proteins, Als3 and Als8) and/or determine surface hydrophobicity (Csa1) or function as adhesins (Als proteins, Hwp1; reviewed in Sundstrom, 2002). In a study aimed at identification of putative GPI cell wall proteins, 54 ORFs were selected from a set of 152 putative GPI proteins by analysing the region immediately upstream of the proposed GPI attachment site for sequence features that are characteristic for cell wall or plasma membrane proteins (Sundstrom, 2002). All the proteins in that study for which BLAST results indicated cell wall association, except for Crh12, were also recognized by our algorithm. Crh12 was not recognized by the algorithm because of the presence of a glutamic acid residue in the hydrophobic tail, but it further shows normal GPI protein characteristics. Crh12 and its homologues Crh11 and Utr2, are structurally related to the Crh protein family in S. cerevisiae, which has been suggested to be involved in β-1,3-glucan remodelling or incorporation of chitin into the β-glucan network of the cell wall (Rodriguez-Pena et al., 2000). In total, 16 ORFs from the Sundstrom (2002) dataset were not found in our screen and six of those represent second alleles of other listed GPI proteins. Sundstrom (2002) performed their screen on Stanford genome assembly 6, which comprises 9168 ORFs, whereas we used a non-redundant genome file comprised of 6726 ORFs that was created by independent annotations by the Galar Fungail consortium based on the more updated Stanford genome assembly 19. Five other ORFs were not recognized by our GPI algorithm and the remaining five ORFs were discarded from our list because of other non-GPI-like properties, such as the absence of a clear signal peptide for secretion.

The GPI algorithm, which was primarily designed to identify GPI proteins of S. cerevisiae and C. albicans, also seems to be quite effective and selective for application in other fungi. First of all, the C-terminal regions of all currently known fungal GPI proteins from other fungal species match with the GPI algorithm. Second, BLAST results indicated that 58% of the Sz. pombe and 44% of the N. crassa proteins that we have identified are homologous to known or putative GPI proteins. Third, the number of proteins we identified in the different species is comparable, although in Sz. pombe, consistent with the low amount of galactomannan in its walls (Sietsma and Wessels, 1990), we found only 33 putative GPI proteins. When the GPI algorithm was further tested on the rice blast fungus Magnaporthe grisea, we obtained a set of putative GPI proteins similar to the closely related N. crassa. However, the genome file for M. grisea, which can currently be obtained from the Whitehead Institute, represents an early genome assembly and therefore it is not yet possible to provide detailed information.

In order to predict which GPI proteins remain attached to the plasma membrane and which proteins are covalently attached to β-1,6-glucan in the cell wall, we have analysed the sequence requirements for the ω region for efficient incorporation of proteins into the cell wall, as proposed by earlier studies on S. cerevisiae (Hamada et al., 1998b, 1999; Vossen et al., 1997). First, we have analysed in known S. cerevisiae cell wall proteins the occurrence of V, I or L at positions 4 or 5 amino acids upstream of the ω site and of Y or N two amino acids upstream of the predicted ω sites, which has been shown to be positively correlated with cell wall incorporation (Hamada et al., 1999). In the cell wall proteins Cwp1, Cwp2 and Ssr1, none of these amino acids is present at these positions (Table 2; Caro et al., 1997), which indicates that it is not allowed to use these features too strictly as a direct requirement for cell wall localization. Second, the presence of basic residues in the ω region is often found in plasma membrane proteins but not in cell wall proteins (Vossen et al., 1997). Indeed, we also find that positively charged side-chains are rare in the ω region of the known cell wall-associated proteins of S. cerevisiae and C. albicans. However, experimental evidence supporting this observation is lacking, which makes it dangerous to further classify the newly identified GPI proteins solely on the basis of this sequence feature. Also, recent immunological studies showed that part of the plasma membrane protein Gas1 is associated with the cell wall network, which indicates that a default localization of GPI proteins may not exist (Meyer et al., 2002). Furthermore, analysis of ω sequence characteristics is hampered by the fact that GPI proteins often seem to have multiple potential sites for GPI attachment. Firm conclusions on sequence requirements for cell wall and plasma membrane localization of GPI proteins awaits a more detailed analysis of GPI attachment sites and upstream regions and the identification of GPI proteins isolated from cell walls or the plasma membrane.

Acknowledgements

We thank all members of the EU-funded Framework V program ‘Galar Fungail’ for valuable discussions and their contribution to gene annotations. We are indebted to Christophe d'Enfert of the Institut Pasteur (Paris) for creating the CandidaDB database and for help with initial GPI motif searches. CandidaDB was created using sequence data for Candida albicans obtained from the Stanford Genome Technology Center (Stanford, CA) which also created SGD. Sequencing of C. albicans was accomplished with the support of the NIDR and the Burroughs Wellcome Fund. The Neurospora genome sequence was obtained from Whitehead Institute/MIT Center for Genome Research and annotated genome sequences of Sz. pombe were retrieved from the Sanger Institute (Hinxton, UK). PdG was supported by the European Commission (QLRT-1999-30795).

Ancillary