N-terminal sequence of proteins secreted by the type III secretion apparatus
To identify proteins secreted by the Mxi–Spa secretion apparatus, we determined the N-terminal sequence of proteins secreted by the ΔipaBCDA strain SF635, a derivative of the wild-type S. flexneri strain M90T (serotype 5) that secretes constitutively (Parsot et al., 1995). Secreted proteins were separated by SDS–PAGE, and proteins ranging in size from 10 to 65 kDa were used for N-terminal sequence determination. Some species contained two proteins, the N-terminal sequence of which could nevertheless be identified unambiguously on the basis of the relative intensities of the signals obtained at each sequencing cycle. Thus, the N-terminal sequence of 14 proteins was determined (Table 1). Two of these sequences corresponded to the N-terminal sequences of VirA and IpaH9.8 (Uchiya et al., 1995; Demers et al., 1998). In addition, these data indicated that four proteins encoded by the entry region, MxiC, MxiL, Spa32 and IpgB1, were secreted (see below). The genes encoding the other secreted proteins were then identified by determining the complete sequence of the virulence plasmid pWR100 from the wild-type strain M90T.
Table 1. N-terminal sequence of proteins secreted by the ΔipaBCDA mutant SF635.
|Size (kDa)||N-terminal sequence||Protein|
Sequencing of pWR100
To determine the sequence of pWR100, we used both a library of 10 cosmids carrying overlapping inserts of 40 kb (Maurelli et al., 1985) and recombinant plasmids carrying most of the BamHI and SalI fragments subcloned from pWR100. The final sequence assembly of the virulence plasmid was confirmed by comparing the restriction patterns predicted from the sequence with those obtained by digestion of pWR100 with different enzymes. No major differences were detected between the sequence determined here and previously for genes identified on pWR100, pMYSH6000 (the virulence plasmid from a S. flexneri strain of serotype 2a; Sasakawa et al., 1986) or the virulence plasmids from other S. flexneri strains (Sasakawa et al., 1992; Parsot and Sansonetti, 1999).
pWR100 is composed of 213 494 bp (Fig. 1). Sequence analysis identified 93 DNA fragments that have over 90% identity with known insertion sequence (IS) elements and constitute ≈ 58 kb of pWR100 (Table 2). Several DNA fragments were tentatively identified as parts of seven putative new IS elements, designated ISSfl1–ISSfl7 (Mahillon and Chandler, 1998). These new ISs correspond to 29 fragments that account for 19 kb of sequence. Accordingly, one-third of the plasmid appears to correspond to IS elements. Outside these ISs, we identified ≈ 100 open reading frames (ORFs) that correspond to potential genes and represent 100 kb of sequence. These ORFs were labelled according to the co-ordinates (in kb) of their position on the sequence, and we used the current nomenclature to designate most of the ORFs that correspond to genes identified previously (Sasakawa et al., 1992). Sequence analysis indicated that pWR100 carries several multigene families encoding related proteins. For the sake of clarity, genes belonging to the same family were designated using the same generic name, e.g. ospD, followed by a different number, e.g. ospD1, ospD2 and ospD3. Accordingly, the previously characterized senA gene that was proposed to encode an enterotoxin (Nataro et al., 1995) was designated ospD3.
Figure 1. Genetic map of pWR100. The position and direction of transcription of the various genes and ORFs are indicated by arrows. ORFs for which no function could be proposed are labelled according to their co-ordinates (in kb) on the sequence. Genes truncated or inactivated by frameshifts are indicated in parenthesis. The colours refer to the G+C content of each ORF: red, < 40%; blue, between 40% and 50%; green, > 50%. The position of ISs is indicated by yellow bars. The sequence of pWR100 has been submitted to the DDJ/EMBL/GenBank databases under accession number AL391753.
Download figure to PowerPoint
Table 2. IS elements detected on pWR100.
|IS (name)||Length (bp)||GC (%)||Fragments (number)||Total size (bp)||Complete IS|
The entry region
Genetic analysis has shown that a 31 kb fragment of the virulence plasmid is necessary and sufficient for entry of bacteria into epithelial cells (Maurelli et al., 1985; Sasakawa et al., 1988; 1993). This fragment contains 34 genes that are clustered in two regions transcribed in opposite directions. The first region contains 10 genes, from icsB to virB, and the second region contains 24 genes, from ipgD to spa40 (Fig. 1). Many of these genes have already been studied (see References in the Introduction and Hueck, 1998 for a review) and will not be described here.
N-terminal sequencing of proteins secreted by SF635 indicated that, in addition to IpaA–D and IpgD, several other proteins encoded by the entry region are secreted: Spa32, MxiC, MxiL and IpgB1. Both Spa32 and MxiC exhibit sequence similarities to components of the type III secretion apparatus of Yersinia and Salmonella spp., and Spa32 has been shown to be involved in the activity of the Mxi–Spa secretion apparatus (Watarai et al., 1995b). No protein homologous to MxiL was detected in protein sequence databases; however, the position of the mxiL gene within the mxi operon suggests that MxiL is involved in the secretion apparatus. Inactivation of ipgB1 had no effect on secretion of the IpaA–D proteins (R. Ménard, P. Sansonetti and C. Parsot, unpublished results), which suggests that IpgB1 is not a component of the secretion apparatus. IpgB1 exhibits 20% sequence identity with (i) TrcA (Tobe et al., 1999a) and LEE19 (Elliott et al., 1998), which are encoded by the chromosomal LIM and LEE loci, respectively, of enteropathogenic E. coli (EPEC) strains; (ii) TrcP, which is encoded by the adherence factor plasmid of EPEC (Tobe et al., 1999b); and (iii) IpgB2, which is also encoded by pWR100. The most closely related proteins are IpgB2 and TrcA, which exhibit 37% sequence identity. TrcA has been proposed to be a cytoplasmic chaperone required for the production of BfpA and intimin (Tobe et al., 1999b). Genes encoding proteins related to BfpA and intimin are not present on pWR100, and the observation that IpgB1 is secreted by the type III secretion apparatus of S. flexneri suggests that related proteins are also secreted.
Genes located between virB and spa40 all have a low G+C content (average 34.2%). On the left side, the fragment located between virB and the closest IS element (IS100) is 3 kb long, has a G+C content of 38.7% and contains one ORF, ipaJ (Buysse et al., 1997). On the right side, the fragment located between spa40 and IS600 is 0.9 kb long, has a G+C content of 33.2% and contains two ORFs, orf131a and orf131b. This suggests that these genes and their flanking regions have the same origin as the ipa, ipg, mxi and spa genes. The IS elements located on both sides of the entry region are truncated, and each lacks the extremity directed towards the entry region, which indicates that rearrangements, including deletion of one extremity of each IS and its flanking sequences, had taken place after acquisition of the entry region.
The osp genes
The genes encoding all the proteins secreted by SF635 (Table 1) were identified on pWR100. These include virA (Uchiya et al., 1995), ipaH4.5, ipaH7.8, ipaH9.8 (Hartman et al., 1990; Venkatesan et al., 1991; Demers et al., 1998; see below) and six new genes that were designated osp (outer Shigella proteins): ospB, ospC1, ospD1, ospE1, ospF and ospG. In addition, genes encoding proteins with sequence similarities to Osp proteins were also detected on pWR100. These sequence similarities suggested that the products of these genes might also be secreted, and these genes were designated ospC2, ospC3, ospC4, ospD2, ospD3 and ospE2. The N-terminal sequences of OspE1 and OspE2 are identical, and the 10 kDa protein secreted by SF635 could correspond to OspE1, OspE2 or a mixture of both. The sizes of OspD2, OspD3, OspC2 and OspC3 are similar to those of IpaH proteins (about 60 kDa), which might explain why these proteins were not resolved from IpaH proteins by SDS–PAGE. Alternatively, the conditions of expression of these proteins might be different from those used to grow SF635. The nucleotide sequences of ospE1 and ospE2 and those of ipaH1.4 and ipaH2.5, which are located downstream from ospE1 and ospE2, respectively, are almost identical, indicating that these regions result from a duplication event. This duplication was followed by the insertion of IS elements upstream from ospE1, within the ospE2–ipaH2.5 intergenic region, and downstream from ipaH2.5. The sequences of ospC2, ospC3 and ospC4 exhibit 96% identity to each other. However, two deletions within ospC4, at positions 98 and 560, result in frameshifts that inactivate this gene. The sequence of ospC1 exhibits only 74% identity with those of the other ospC genes, suggesting a more ancient duplication event. Members of the ospD family are more distantly related compared with those of the ospE and ospC families. OspD2 (569 residues) and OspD3 (SenA; 565 residues) exhibit 38% sequence identity over their entire length, and the C-terminal region of both proteins contains six repeats of 44 residues (Fig. 2). A similar repeat is present three times in the C-terminal region of OspD1 (225 residues).
Figure 2. The repeated motifs of OspD proteins. The sequences of the C-terminal domain of OspD1 (225 residues), OspD2 (569 residues) and OspD3 (565 residues) have been aligned to show the 44-residue repeated motif. Residues that are identical in at least six repeats are underlined and indicated in the consensus sequence. Dots indicate gaps that have been introduced to maximize the alignment.
Download figure to PowerPoint
No proteins sharing sequence similarity with Osp proteins were detected in protein sequence databases, except for OspF, which exhibits 63% sequence identity with SpvC, a protein encoded by the virulence plasmid of Salmonella typhimurium. The spvC gene is part of a locus comprising five genes that are required for S. typhimurium to cause systemic infections of the reticuloendothelial organs during experimental infections of animals (Gulig et al., 1993). The observation that OspF is secreted by S. flexneri suggests that SpvC might also be secreted.
The G+C content of osp genes ranges from 31.2% (ospC1) to 37.9% (ospF), which suggests that these genes have the same origin as the entry region (34.2% G+C). Indeed, the G+C content of the genes appears as a discriminatory criterion to classify genes carried by pWR100. Of the 109 genes and ORFs detected on pWR100, 56 have a G+C content lower than that of spa47 (39.3%), which is the gene of the entry region that has the highest G+C content. In addition to genes of the entry region, these include all the osp genes and only a few other genes: virF (30.4%), orf137 (31.0%), orf13 (31.6%), orf81 (34.1%), orf212 (34.5%) and icsP (sopA) (37.6%). Both virF and orf81 encode proteins with similarities to transcriptional activators of the AraC family. VirF is required for the expression of virB and therefore is functionally related to the entry region. No convincing similarities were detected between the sequences of Orf13, Orf137 or Orf212 and proteins present in databases. The icsP gene encodes an outer membrane protease that is involved in cleavage of IcsA and does not appear to be related to genes of the entry region (Egile et al., 1997; Shere et al., 1997). Whether the products of the ORFs other than icsP that have a low G+C content have any functional relationship with the Mxi–Spa secretion machinery or Osp proteins will require further investigation.
The ipaH genes
Previous analysis has shown that pWR100 carries five ipaH genes, which were designated ipaH1.4, ipaH2.5, ipaH4.5, ipaH7.8 and ipaH9.8 according to the size of the HindIII fragment of the virulence plasmid that carries each gene. The present analysis indicates that ipaH1.4 and ipaH2.5 are not incomplete, as was proposed (Venkatesan et al., 1991), and corrects the reading frame of the beginning of ipaH7.8 (Hartman et al., 1990). This correction is consistent with the N-terminal sequence determined for a protein of the size of IpaH7.8 that is secreted by SF635. Except for the presence of an IS629 element at the 3′ end of ipaH2.5, the entire sequences of ipaH1.4 and ipaH2.5 are almost identical and, for the sake of clarity, the ipaH2.5 gene will not be discussed further.
Members of the ipaH family are characterized by (i) a 5′ variable region of 600–760 bp that encodes six to eight repeats of a 20-residue motif; and (ii) a 3′ constant region of 839 bp that is identical in all genes (Venkatesan et al., 1991). In fact, the region of identity between ipaH9.8, ipaH7.8 and ipaH4.5 is extended by 98 bp at the 5′ end of the constant region, and the region of identity between ipaH7.8 and ipaH4.5 is extended by 29 bp further upstream (Fig. 3). Downstream from the 3′ end of the constant region, defined by the site of insertion of an IS629 element in ipaH4.5 (and ipaH2.5), the sequences of ipaH7.8, ipaH9.8 and ipaH1.4 are identical over 55 bp, except for a small deletion within ipaH7.8. In contrast, there is little sequence similarity among the 5′ parts of ipaH genes, even though the deduced amino acid sequences are related. The difference in sequence conservation of the 5′ and 3′ parts of ipaH genes cannot be explained in a classical scheme of divergent evolution after duplication of an ancestral copy, even by hypothesizing that functional constraints on the C-terminal domain of IpaH proteins might be more important than those on the N-terminal domain. Moreover, the G+C contents of the variable and constant regions of ipaH genes are clearly different; the G+C content of the variable regions ranges from 35.2% for ipaH7.8 to 39.3% for ipaH4.5 and that of the constant region is 53.9%. These observations indicate that the two parts of ipaH genes have different origins and suggest that the constant region of ipaH genes might result from independent conversion events on pre-existing copies of ipaH‘genes’ by an unknown mechanism.
Figure 3. Schematic representation of the constant region of ipaH genes. The sequences of various fragments of each ipaH gene are shown, separated by dashes. Numbers on the first line refer to the co-ordinates of the nucleotides (indicated by stars) within the coding sequence of ipaH7.8. Regions that are identical between several ipaH genes are boxed, and their length is indicated on the last line. For the sake of clarity, the sequence of ipaH2.5, which is identical to that of ipaH1.4, is not shown.
Download figure to PowerPoint
The icsA and virK regions
The icsA and virK genes are not linked on the plasmid; however, they are both present within clusters of genes that have a G+C content of approximately 42%. The first cluster includes four genes, icsA (41.0% G+C), orf149 (41.7% G+C), ushA (40.5% G+C) and phoN1 (42.4% G+C), which are separated by ISs (Fig. 1). Part of the sequence deduced from orf149 exhibits 82% identity with an internal fragment of PapC (between residues 268 and 349), the outer membrane usher protein of Pap pili from E. coli. The ushA gene encodes a protein that exhibits 76% identity with the UshA protein of E. coli and S. typhimurium, a periplasmic UDP-sugar hydrolase (Burns and Beacham, 1986; Edwards et al., 1993). The phoN1 (phoNsf) gene encodes a periplasmic acid phosphatase (Uchiya et al., 1996) that exhibits 50% sequence identity with the product of phoN2 (apy), another periplasmic phosphatase encoded by the virulence plasmid (Bhargava et al., 1995). The second cluster of genes includes orf185, orf186, virK and msbB2. The orf185–orf186, orf186–virK and virK–msbB2 intergenic regions are 2, 4 and 63 bp long, respectively, suggesting that these genes and ORFs might be part of the same operon. The products of orf185, orf186 and virK exhibit 90% sequence identity with the products of three adjacent ORFs carried by pAA2, a plasmid harboured by enteroaggregative E. coli (Czeczulin et al., 1999). Likewise, the products of orf185, orf186 and msbB2 exhibit 65% sequence identity with the products of ORFs carried by pO157, the virulence plasmid of an EPEC strain (Burland et al., 1998; Makino et al., 1998). Genes encoding proteins related to Orf185, Orf186, VirK and MsbB2 are also present on the chromosome of various bacteria, including, E. coli, S. typhimurium and Neisseria meningitidis, indicating that, although widespread among virulence plasmids, these genes are not specific to plasmids. The MsbB2 protein exhibits 70% sequence identity with the chromosomally encoded MsbB protein of E. coli, an acyltransferase that is involved in lipid A modification (Clementz et al., 1997).
The similar G+C content of icsA, orf149, ushA and phoN1 and their genetic linkage suggest that these genes might have the same origin, and the presence of a remnant of the papC gene (orf149) suggests that, in the past, this region might have carried a pap operon. The function of VirK is not yet known; however, a functional relationship between VirK and IcsA is suggested by the phenotype of the virK mutant (Nakata et al., 1992). In that respect, it is noteworthy that the icsA and virK regions have a similar G+C content that is different from other parts of the plasmid.
Replication, partition and transfer functions
Hybridization studies indicated that pMYSH6000 is a RepFIIA (IncFII) replicon (Makino et al., 1988; Silva et al., 1988). This system consists of (i) oriR, which is the origin of replication; (ii) RepA, which is required for replication at oriR; (iii) CopB, which represses transcription at the repA promoter; (iv) TapA, which is required for expression of RepA; and (v) CopA, which is an antisense RNA functioning as a copy number control element (Blomberg et al., 1994; Malmgren et al., 1997). Each of these elements was detected on pWR100. RepA, CopB and TapA of pWR100 exhibit 95–100% sequence identity with the corresponding proteins from plasmids R100, a 90 kb conjugative IncFII resistance plasmid from a clinical isolate of S. flexneri (Nakaya et al., 1960), and pO157. As for other IncFII replicons, a DnaA box and a putative origin of replication were detected downstream from repA on pWR100. The putative CopA antisense RNA of pWR100 exhibits over 90% sequence identity with those of plasmids R100 and pO157, which suggests that the copy number of pWR100 might be similar to that of R100, i.e. one or two copies per chromosome.
Partitioning systems are essential for inheritance of low-copy-number plasmids in daughter cells. Sequence analysis revealed the presence of two partitioning systems on pWR100 (Fig. 4). The first system, designated par (bp 29 020–31 196), is similar to the parABS system of bacteriophages P1 and P7 (Ludtke and Austin, 1987; Chattoraj and Schneider, 1997) and plasmid pMT1 of Yersinia pestis (Lindler et al., 1998), whereas the second system, designated stb (bp 158 164–159 515), is similar to the stbAB system of plasmids R100 (NR1) (Miki et al., 1980; Tabuchi et al., 1988) and pB171, the adherence factor plasmid of EPEC (Tobe et al., 1999b). ParA and ParB of pWR100 exhibit 75% and 58% sequence identity to ParA and ParB, respectively, of P1, and the region located downstream from parB on pWR100 contains sequence motifs that are similar to those of the parS site of P1 (Hayes and Austin, 1994). On the other hand, StbA and StbB of pWR100 exhibit 39% and 29% sequence identity with StbA and StbB, respectively, of R100, and a putative cis-acting site exhibiting a strong A+C/T+G bias and several repeats (Tabuchi et al., 1988) is present upstream from stbA on pWR100 (Fig. 4). These similarities suggest that both the par and the stb systems of pWR100 are functional. Another protein encoded by pWR100, VirB, exhibits sequence similarities with members of the ParB family (Adler et al., 1989). However, no gene encoding a protein homologous to ParA is present in the vicinity of virB or elsewhere on the virulence plasmid (except for the parA gene described above), which suggests that VirB is not involved in plasmid partitioning. VirB is required for transcription of genes of the entry region by a mechanism that has not yet been elucidated.
Figure 4. The par and stb regions of pWR100. The position and orientation of the par and stb genes of pWR100, R100 and P1 are shown by arrows, with the number of residues of encoded proteins. The nucleotide sequences of the 3′ part of parS sites (indicated by boxes) are shown, with repeated motifs indicated in bold characters. Upstream from the stbA genes, the cis-acting sites are indicated by boxes (not to scale), and their A+C contents are indicated within boxes. The sequences and numbers of occurrences of repeats detected in the cis-acting sites are shown on the left.
Download figure to PowerPoint
Plasmids often carry a system that is involved in post-segregational killing of bacteria that have not received a copy of the plasmid. pWR100 encodes two proteins that exhibit 85% and 95% sequence identity to CcdA and CcdB, respectively, of plasmid F, which are responsible for killing of segregant cells (Jaffe et al., 1985). In addition, pWR100 carries the mvpT and mvpA genes that are located on the complementary strand of the trbH gene and encode a toxin and an antidote, respectively, which can promote stable inheritance of recombinant plasmids carrying the origin of replication of pMYSH6000 (Radnedge et al., 1997; Sayeed et al., 2000).
Analysis of pWR100 revealed the presence of a 8.4 kb region that exhibits 96% sequence identity with parts of the transfer region of plasmid F. This region is limited by two copies of IS600 and corresponds to the 3′ part of traD, a complete trbH gene, a traI gene interrupted by an internal deletion leading to a frameshift and the traX and finO genes. Other tra genes were not detected on pWR100, which is consistent with previous results suggesting that pWR100 is non-conjugative (Sansonetti et al., 1982).
The sequence of six adjacent ORFs (orf159b, 160, 161a, 161b, 162 and 163) covering a 3.6 kb region exhibits 94% identity with that of ORFs carried by pCollIB-P9 and pO157. The presence of these genes on different plasmids and the absence of related genes on bacterial chromosomes suggests that they might be involved in plasmid maintenance or transfer. The G+C content of this region (56.9%) is similar to that of the tra region discussed above (57.1%).
Sequence analysis of pWR100 indicated that one-third of the plasmid is composed of IS elements. These are likely to be responsible for the heterogeneity detected in restriction patterns of the virulence plasmid harboured by strains of different species or of different serotypes within the same species (Sansonetti et al., 1983b). Differences in G+C content among genes carried by pWR100 provide strong evidence that the plasmid is composed of a mosaic of blocks of genes from different origins. Genes of the entry region and osp genes have a similar G+C content (average 34%), which is different from those of the transfer and replication regions (55% G+C), the icsA and virK regions (41% G+C) or sepA (49% G+C). This suggests that the entry region and osp genes have the same origin and were probably acquired simultaneously, even though they are not linked on the plasmid. Insertion of IS elements and DNA rearrangements involving deletions (indicated by the presence of truncated ISs adjacent to the entry region), duplications (indicated by the presence of several almost identical copies of ospC genes) and inversions (suggested by the opposite orientations of the identical ipaH1.4 and ipaH2.5 genes) have modified the organization of the ancestral blocks and have probably led to the loss of some genetic information initially carried by these blocks (as suggested by the absence of the impABC genes on pWR100; Runyen-Janecky et al., 1999). The presence of two seemingly functional partition systems (parAB and stbAB) and of the remnant of a third one (virB) suggests that parts of pWR100 come from three plasmids. Moreover, the G+C content of the replication system (repA, 58%) is different from those of the above-mentioned partition systems, virB (34%), stbAB (39%) and parAB (43%), which suggests that the virulence plasmid contains elements that were initially carried by four plasmids.
Determination of the N-terminal sequence of proteins present in the culture medium of strain SF135 allowed us to identify new proteins secreted by the type III secretion machinery of S. flexneri. These and previous results indicate that 19 proteins are secreted by the Mxi–Spa secretion apparatus: Spa32, MxiC and MxiL, which might be involved in the assembly or regulation of the secretion apparatus, and IpaA–D, IpgB1, IpgD, OspB–G, VirA and three IpaH proteins. Analysis of pWR100 also revealed the presence of eight genes encoding proteins with sequence similarities to secreted proteins, which suggests that the repertoire of proteins secreted by the Mxi–Spa secretion apparatus consists of ≈ 25 proteins. Several secreted proteins are encoded by multigene families (ipaH, ipgB, ospC, ospD and ospE), and it is not known whether or not all members of a family are expressed in the same environmental conditions. No genes encoding potential chaperones were detected in the vicinity of osp and ipaH genes, which suggests that the requirement for specific cytoplasmic chaperones concerns only a subset of proteins that are secreted by the type III secretion pathway (Wattiau et al., 1996). Including components of the secretion apparatus, secreted proteins, chaperones and regulators, the type III secretion system of S. flexneri comprises about 50 proteins. It seems unlikely that all this genetic information is involved solely in entry of bacteria into epithelial cells. According to the current model of the type III secretion pathway (Hueck, 1998), Osp and IpaH proteins are candidates for being translocated into target cells. The type of cells in which these proteins might be translocated and the cellular processes they might affect remain to be investigated.