The length of a tetranucleotide repeat tract in Haemophilus influenzae determines the phase variation rate of a gene with homology to type III DNA methyltransferases


  • Xavier De Bolle,

    1. Molecular Infectious Diseases Group, University of Oxford, Department of Paediatrics, Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK.
    Search for more papers by this author
    • *Present address: Laboratory of Immunology–Microbiology, URBM, University of Namur, Rue de Bruxelles 61, 5000 Namur, Belgium.

    • †The first two authors contributed equally to this work.

  • Christopher D. Bayliss,

    1. Molecular Infectious Diseases Group, University of Oxford, Department of Paediatrics, Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK.
    Search for more papers by this author
    • †The first two authors contributed equally to this work.

  • Dawn Field,

    1. Molecular Infectious Diseases Group, University of Oxford, Department of Paediatrics, Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK.
    Search for more papers by this author
  • Tamsin Van De Ven,

    1. Molecular Infectious Diseases Group, University of Oxford, Department of Paediatrics, Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK.
    Search for more papers by this author
  • Nigel J. Saunders,

    1. Molecular Infectious Diseases Group, University of Oxford, Department of Paediatrics, Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK.
    Search for more papers by this author
  • Derek W. Hood,

    1. Molecular Infectious Diseases Group, University of Oxford, Department of Paediatrics, Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK.
    Search for more papers by this author
  • E. Richard Moxon

    1. Molecular Infectious Diseases Group, University of Oxford, Department of Paediatrics, Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK.
    Search for more papers by this author

Christopher D. Bayliss. E-mail cbayliss@molbiol.ox.; Tel. (+44) 1865 222347; Fax (+44) 1865 222626.


Haemophilus influenzae is an obligate commensal of the upper respiratory tract of humans that uses simple repeats (microsatellites) to alter gene expression. The mod gene of H. influenzae strain Rd has homology to DNA methyltransferases of type III restriction/modification systems and has 40 tetranucleotide (5′-AGTC) repeats within its open reading frame. This gene was found in 21 out of 23 genetically distinct H. influenzae strains, and in 13 of these strains the locus contained repeats. H. influenzae strains were constructed in which a lacZ reporter was fused to a chromosomal copy of mod downstream of the repeats. Phase variation occurred at a high frequency in strains with the wild-type number of repeats. Mutation rates were derived for similarly engineered strains, containing different numbers of repeats. Rates increased linearly with tract length over the range 17–38 repeat units. The majority of tract alterations were insertions or deletions of one repeat unit with a 2:1 bias towards contractions of the tract. These results demonstrate the number of repeats to be an important determinant of phase variation rate in H. influenzae for a gene containing a microsatellite.


Haemophilus influenzae is a common bacterial commensal of humans that colonizes the upper respiratory tract and has not been isolated from other species or inanimate environmental niches. H. influenzae is potentially pathogenic in humans and causes a number of diseases, including otitis media, septicaemia and meningitis. The capacity to phase vary a number of surface-expressed molecules is a critical factor in determining the pathogenic potential of H. influenzae.

Phase variation is the reversible, high-frequency gain and loss of a phenotype, resulting from changes in expression of a single or of multiple genes. The most common molecular mechanism of phase variation in H. influenzae involves mutations in simple DNA repeats or microsatellites located within the reading frames or promoters of genes encoding variant proteins (Weiser et al., 1989; van Ham et al., 1993). In the simplest case, loss or gain of a tetranucleotide repeat within an open reading frame (ORF) results in a frameshift, so that the protein is either translated or it is not. For example, alterations in the tetranucleotide repeat tract within lic1, a gene required for the addition of phosphorylcholine to the core sugars of H. influenzae lipopolysaccharide (LPS), results in phase-variable switching between a phenotype adapted to promote colonization of the respiratory tract or, alternatively, one conferring resistance to C-reactive protein-dependent complement-mediated lysis (Weiser and Pan, 1998). This and other examples (Weiser et al., 1990; van Ham et al., 1993; Hood et al., 1996) provide compelling evidence that the presence and number of repeats in phase-variable genes are determinants of commensal and pathogenic behaviour in H. influenzae, and the evidence is persuasive that this is also true for other bacteria that contain microsatellites within genes (Moxon et al., 1994; Dehio et al., 1998).

The sequencing of complete bacterial genomes has simplified the identification and investigation of loci containing simple repeats or microsatellites (Hood et al., 1996; Saunders et al., 1998). In the case of H. influenzae, a search of the complete genome sequence (Hood et al., 1996) identified 12 loci with multiple (> 5) tetranucleotide repeats; 11 of these loci contained more than 15 repeat units and the number of units ranged from 6 to 36. For some of these loci, the numbers of repeat units also differ widely between strains; thus, in 27 typeable and non-typeable H. influenzae strains, the number of repeats in lic1 ranged from 5 to 57 (High et al., 1996). Although it is assumed that these differences have biological significance, such as altered mutation rates, this has not been shown experimentally.

In contrast to bacterial genomes, microsatellites are abundant in eukaryotic genomes. The molecular mechanisms that change the number of repeat units within eukaryotic microsatellites have been extensively studied, using both yeast and Escherichia coli as model systems. Many cis- and trans-acting factors have been found to have an influence on the rates of variation (see Wierdl et al., 1997 and references therein). The main consensus of this work is that slip strand mispairing (Streisinger et al., 1966; Levinson and Gutman, 1987a) is largely responsible for changes in the numbers of repeats. One major cis-acting factor has been investigated, the length of the repeat tract. Studies in yeast have shown that mutation rates increase with tract length for mononucleotide (Tran et al., 1997) and trinucleotide (Freudenreich et al., 1998) tracts integrated into the chromosome and for dinucleotide (Wierdl et al., 1997) tracts in plasmids. In E. coli, most studies have used plasmid or bacteriophage constructs. In these systems, mutation rate also increases with tract length for mono-, di- and trinucleotide tracts (Streisinger and Owen, 1985; Levinson and Gutman, 1987b; Freund et al., 1989; Bichara et al., 1995; Kang et al., 1995; Strauss et al., 1997; Sarkar et al., 1998). Morel et al. (1998), however, used E. coli chromosomal constructs to show that mutation frequency increases with tract length for a dinucleotide repeat. No comprehensive studies of this type have been reported for other bacterial systems and, in particular, there have been no reports in bacteria that use simple repeats to drive phase-variable gene expression.

E. coli has been used in many studies of eukaryotic microsatellites, even though microsatellites are under-represented in its genome (Karlin et al., 1997). H. influenzae and a number of other bacterial species differ from E. coli in having an abundance of certain types of microsatellites with long tract lengths (Hood et al., 1996; Karlin et al., 1997; Saunders et al., 1998), tetranucleotides in the case of H. influenzae. For H. influenzae, Neisseria meningitidis, Neisseria gonorrhoeae and some other bacteria, these long microsatellites are functional, i.e. they drive phase-variable gene expression (see above and Hammerschmidt et al., 1996; Dehio et al., 1998). No functional microsatellites have been described in E. coli, a possible indication that the mechanisms causing or repairing mutations in microsatellites differ between E. coli and the phase-variable bacteria. It was necessary, therefore, to measure the mutation rates of the H. influenzae microsatellites directly, rather than using model constructs in E. coli or basing estimates on previous results of other systems.

In H. influenzae, all of the tetranucleotide repeats are located within open reading frames and, in general, are within genes known or predicted to encode either outer membrane proteins or proteins that synthesise an outer membrane component (Hood et al., 1996); this includes four genes involved in LPS biosynthesis and four in iron uptake. For these genes, phase-variable expression might be expected as their products are likely to be, directly or indirectly, targets for surveillance by the immune system. A less intuitive finding was of tetranucleotide repeats in a gene with homology to the methylation subunit of a restriction/modification (R/M) system (Hood et al., 1996). Although it is likely that each of these genes phase varies at particular rates (the result of the force of natural selection on a particular locus), it is also possible that there are general factors that influence the mutation rates of all the loci in H. influenzae. We have therefore investigated phase variation in H. influenzae using one of the loci that naturally contains repeats, the methyltransferase homologue, as a model locus. A lacZ reporter was inserted into the phase-variable methyltransferase homologue, which was engineered such that it contained different numbers of repeats. We demonstrate that mutation rates of a tetranucleotide repeat tract in the H. influenzae chromosome increase with the length of the tract. We also developed a model to investigate the implications of these switching rates as they pertain to the pathogenesis of H. influenzae infection. These results are important for our understanding of those phase variation events in H. influenzae that are driven by instabilities in microsatellites.


Mod is homologous to type III DNA methyltransferases

A search of the whole genome sequence of H. influenzae strain Rd identified tetranucleotide repeats in a gene, HI1056, with low homology to methyltransferases (Hood et al., 1996). A blast search (Altschul et al., 1990) of a non-redundant data base was performed with the deduced amino acid sequence of HI1056 for which the closest homologues were type III DNA methyltransferases; expected values (E ) for the EcoPI and StylTI enzymes were 2E-27 and 1E-19 respectively. Homology extended throughout the length of the reading frame (in a comparison with EcoPI, there were 106 similarities over 222 amino acids). This gene was therefore termed mod. A blast search with the deduced amino acid sequence of gene HI1055, whose initiation codon is 44 bp within the 3′ end of mod, identified homology with the restriction subunit of EcoPI (E-value 0.009; 49 similarities over 102 amino acids). This gene was therefore termed res. The mod ORF has two potential initiation codons (Karlin et al., 1996); the distal (5′) initiation codon is defined in the Rd sequence as the start codon of the predicted gene HI1058, an ORF with no homologues (Fig. 1). The two predicted translations of mod code for proteins of either 72 kDa (proximal ATG) or 86 kDa (distal ATG), whereas the res subunit is 60 kDa. The tetranucleotide (5′-AGTC) repeats are on the leading strand and are located 50 nt downstream of the closest ATG and 408 nt from the more distant ATG.

Figure 1.

. Schematic representation of the constructs inserted into the mod gene of H. influenzae. The upper diagram represents the wild-type locus. The ORFs are represented by open boxes. The repeats located in the mod ORF are represented by a grey box. The mod and res ORFs are partially overlapping. The mod ORF is preceded by an ORF coding for RNase HII. The HI numbers, from the Rd genome sequence, for the ORFs are indicated. The two potential ATG codons (distal and proximal), which are not in the same frame, are indicated by bsl00099 . The direction of transcription is shown by a dotted line. Primers used for PCR are indicated by arrows showing their orientation. The lower diagram shows the insert of pGΔZK-wt. In this construct, fusion between the mod and lacZ ORFs occurs at the KpnI site. The kanR gene is in the opposite transcriptional direction relative to lacZ. This construct also contains a fragment of the res ORF to allow insertion by a double cross-over recombination event (indicated by the two crosses). The sequence of the construct from the wild-type mod gene to the KpnI site (underlined) in lacZ is shown at the bottom of the figure. This construct contained 40 repeats, which places lacZ in frame with the distal ATG.

When other type III DNA methyltransferases were used to screen the ORFs of the complete sequence of the H. influenzae Rd genome (Fleischmann et al., 1995) using the blast program, only mod was found to have significant homology. The same result was obtained for the res ORF. These results suggest that mod and res comprise the sole type III modification/restriction (R/M) system in H. influenzae strain Rd.

Prevalence of repeats in mod for a natural population of H. influenzae strains

In order to determine whether tetranucleotide repeats are normally associated with the mod/res locus, we surveyed a genetically distinct subset of H. influenzae strains, 23 non-typeable otitis media isolates chosen on the basis of ribotyping (G. Buldoc and R. Goldstein, personal communication). The res gene was shown to be present in all 23 strains, using PCR amplification with primers (RESN and RESC) that bind to the 5′ and 3′ ends of the gene (see Fig. 1). The size of the PCR amplification product was identical in all cases. PCR amplification of the mod gene (see Fig. 1 for primer binding sites), using primers H1RNH2, which binds 453 nt upstream of the distal ATG, and EXB1, which binds to the 3′ end of the gene, produced products from 15 of the strains. These PCR products showed some heterogeneity in size. PCR amplification with RESC and a primer MODREV1R, which binds in the N-terminal portion of mod (110 nt downstream of the repeats), generated fragments of the predicted size from a further six strains. PCR products were also generated from these six strains with primers H1RNH2 and MODREV1, the reverse, complement of MODREV1R. These results indicate that 21 out of the 23 strains contain an intact copy of mod. Finally, when PCR was performed with primers H1RNH2 and KPN1MOD (this primer binds to the 3′ side of the repeats), products were observed with the two strains that appeared to lack mod. This demonstrated that all 23 strains have the 5′ end of mod, the region that contains repeats in the Rd genome.

The repeats in the mod locus were analysed by sequencing relevant PCR products with primer MODP2. Sequence data were obtained from 15 strains using the H1RNH2/EXB1 PCR products, from four strains using the H1RNH2/MODREV1 PCR products and from four strains using the H1RNH2/KPN1MOD PCR products. The repeat tract was replaced in 10 strains by a 16 nt sequence that did not include direct repeats but maintained the mod ORF in frame with the distal ATG. Thirteen strains contained repeats of the tetranucleotide 5′-AGCC (the same sequence as observed in H. influenzae strains RM153 and RM7004; Hood et al., 1996). Repeat tracts ranged in length from 2 to 23 repeat units and seven tracts were > 14 units (30% of all the strains examined). Repeat tracts from seven strains were sequenced using the longer (i.e. H1RNH2/EXB1 or H1RNH2/MODREV1) PCR products that permitted an assessment of whether the ORF was in frame. In three strains with repeats (two with two repeats each and one with five repeats), mod was in frame with the proximal ATG. In another three strains with repeats, one with eight, one with 16 and one with 19, mod was in frame with the distal ATG, whereas in one strain containing 10 repeats mod was out of frame with both ATGs. The two strains with either eight or 10 repeats lacked four nucleotides immediately adjacent to the 3′ end of the repeats that were present in the other strains. In conclusion, the majority of H. influenzae disease-causing strains from a natural population contained the mod/res locus, and frequently tetranucleotide repeat tracts, often of a significant length (i.e. > 14 units), were present in this locus.

An H. influenzae serotype f strain (the ATCC standard strain) was also analysed because HinfIII, a type III restriction enzyme, was isolated from a serotype f strain (Kauc and Piekarowicz, 1978). PCR fragments were generated from this strain (using primers H1RNH2/MODREV1R and MODREV1/RESC), suggesting that it contains intact copies of mod and res.

High gene expression from the mod promoter

A plasmid, pGΔZK-wt, was constructed in which lacZ was inserted in frame close to the 3′ end of the repeat tract in mod and in which the majority of mod was deleted (see Fig. 1). The repeat tract in this clone consisted of 40 repeats of the tetranucleotide 5′-AGTC. The lacZ ORF lacked an initiation codon and could therefore only be translated from the mod initiation codon(s). Insertion of a kanR gene downstream of lacZ allowed selection of transformants. Transformants resulting from double cross-over events were, however, only observed when the kanR gene was orientated in the opposite transcriptional direction relative to lacZ. Constructs with this orientation were therefore used in subsequent experiments described in this paper. The other orientation resulted in a large number of transformants containing single cross-over events (data not shown).

H. influenzae strain Rd was transformed with the linearized plasmid pGΔZK-wt and transformants were isolated on kanamycin-containing plates. When these transformants were grown on a medium containing Xgal, they had a dark blue phenotype, suggesting that the mod promoter drives high-level gene expression, at least under these in vitro culture conditions. This allowed the efficient detection of lacZ reporter activity, which was an essential requisite for measuring the rate of phase variation at this locus.

The region containing the repeats was amplified using PCR from extracts of individual colonies and the products were sequenced. Colonies with the dark blue phenotype contained 40 repeats. In contrast, transformants with a white phenotype had 38, 39, 41 or 42 repeats. These results demonstrated that the distal ATG was the initiation codon and suggested that the Mod protein possesses an N-terminal extension not found in other type III DNA methyltransferases.

Gene expression from the mod locus is phase variable

Phase variation of the mod locus in H. influenzae was investigated by measuring the frequency of variants in colonies of RdGΔZK-wt. Colonies of RdGΔZK-wt with a white phenotype grown in the presence of kanamycin contained a higher frequency of blue variants than those grown in its absence (data not shown), therefore all subsequent phase variation experiments were carried out in the absence of antibiotic. In contrast, no such bias in phase variation frequency was observed between RdGΔZK-wt colonies grown in the presence or absence of Xgal (data not shown). Frequencies were measured for RdGΔZK-wt colonies of both phenotypes as described in Experimental procedures. These frequencies were high, ranging from 0.0038 to 0.0431, and similar frequencies were observed for both directions of switching.

To investigate whether the frequency of phase variation was strain dependent, H. influenzae strains Eagan (RM153) and RM7004 were each transformed with plasmid pGΔZK-wt. These transformants showed phase variation frequencies (0.0041–0.0321 for Eagan; 0.0037–0.0678 for RM7004) similar to those observed with RdGΔZK-wt.

Phase variation rate increases with repeat tract length

There are wide variations in the number of tetranucleotide repeats in the microsatellites of H. influenzae. We investigated whether the number of repeats in mod might influence the phase variation rate. PCR products, consisting of the first part of the mod gene with different numbers of repeats, were generated and then fused to the lacZ ORF in pGreslacZ, resulting in six plasmids containing between 6 and 38 repeats (see Fig. 2, also described in Experimental procedures). These plasmids were used to transform H. influenzae strain Rd to give a set of chromosomal mod–lacZ fusions with 6, 10, 18, 24, 31 or 38 5′-AGTC tetranucleotide repeats. The phenotypes of these RdGΔZnR strains were examined by plating transformants in the presence of Xgal. RdGΔZ38R produced predominantly blue colonies (note that the pGΔZnR plasmids have fewer nucleotides between the last repeat and the KpnI site in the lacZ gene than pGΔZK-wt and thus the number of repeats that generates an in frame ORF is different), whereas the others produced mainly white colonies. To measure the rates of switching in both directions (i.e. on to off and off to on), colonies of both blue and white phenotypes were isolated from each strain. Frequencies of phase variation were measured for multiple colonies of each phenotype and strain (Fig. 3). Substantial variation in data points, as seen in Fig. 3, is a characteristic feature of mutation rate measurements as first described by Luria and Delbruck (1943). However, overall, the frequency of variants increased as a function of the length of the repeat tract.

Figure 2.

. Schematic description of the construction of the lacZ fusions with different number of repeats. A first PCR amplification (with primers H1RNH2 and KPN1MOD) was performed on H. influenzae strain Rd genomic DNA to amplify the repeats and the region upstream from them. This PCR product was diluted and used in a second PCR, which was performed with H1RNH2 and a ‘SLIP’ primer, i.e. a primer containing a KpnI site (underlined) near its 5′ end and eight nucleotides complementary to two repeats at its 3′ end. This last feature theoretically allows the hybridization of the primer at 38 different positions in the repeat tract. The concentration of both primers (H1RNH2 and SLIP) was adjusted to avoid non-specific amplification in this second PCR. The PCR products generated by this second PCR were purified and cloned into pGΔZK-wt as described in Experimental procedures. This approach allowed the isolation of six constructs with different numbers of repeats (6, 10, 18, 24, 31 or 38).

Figure 3.

. Phase variation frequencies for tetranucleotide repeat tracts with different numbers of repeat units. H. influenzae strains were constructed that have chromosomal mod–lacZ fusions containing different numbers of 5′-AGTC repeats. Variants containing other repeat numbers were then isolated from these strains. ‘On’β-galactosidase expressing parental strains contained 5, 11, 17, 23, 32 or 38 repeat units, whereas ‘off’ or β-galactosidase non-expressing strains contained 6, 10, 18, 24, 31 or 37 repeat units. All the strains were grown on media containing Xgal, and then isolated colonies were serially diluted and replated on media containing Xgal. Appropriate dilutions were used to estimate the frequency of variants present in the original colony, i.e. the number of colonies of the opposite phenotype to the starting phenotype divided by the total number of bacteria in the original colony. This frequency is plotted against the number of repeats in the parental strain from which each colony was derived. The on-to-off and off-to-on directions of reversion are represented by filled diamonds and open squares respectively. Each point represents analysis of a single colony.

Mutation rates can be calculated from switching frequencies using a variety of methods, from which we have chosen two. Table 1 shows the mutation rates for the RdGΔZ38R/37R/32R/31R/24R/23R/18R and 17R strains as calculated using two different methods. Mutation rates were not calculated for RdGΔZ11R/10R/6R and 5R because variant colonies were so rare (Fig. 3 includes results from multiple colonies of these strains in which no variants were detected), although we can conclude that mutation rates for these constructs were considerably lower than the mutation rates for RdGΔZ17R. The equation of Lea and Coulson (1949) gave the highest rates, which were ≈1.9 times higher than the rates derived using the equation of Drake (1991). The general trend towards higher mutation rates as the length of the repeat tract increased was similar for each of the methods. The only anomaly was that the off-to-on rates for RdGΔZ31R and RdGΔZ24R were inverted, a finding possibly the result of sampling error rather than inherent differences between the strains. The opposite result was observed with the on-to-off rates for RdGΔZ32R and RdGΔZ23R, variants isolated from RdGΔZ31R and RdGΔZ24R respectively. The frequencies of variants were used for a statistical analysis of differences between the switching rates of the RdGΔZnR strains with different tract lengths. A Mann–Whitney (non-parametric) rank sum test yielded P < 0.0001 for comparisons between the combined frequencies of the two strains with short repeat tracts (18R and 17R) and the combined frequencies of those with longer repeat tracts (23R and 24R; 31R and 32R; 37R and 38R) and P = 0.01 for the comparison between 23R, 24R and 37R, 38R. We conclude that mutation rates increase as a function of the length of the repeat tract.

Table 1. . Influence of 5′-AGTC tetranucleotide repeat tract length on phenotypic mutation rates in H. influenzae. a. f is the median value (or the average of the two median values) for the frequency of variants per colony.b.n is the number of colonies examined.c. Calculated using the median frequency by the method of Drake (1991).d. Calculated using the method of the median according to Lea and Coulson (1949).Thumbnail image of

The mutation rates described in Table 1 are phenotypic mutation rates. The genetic events were examined by sequencing repeat tracts from variant colonies. The results are shown in Table 2. The majority of events were frameshifts of one repeat unit (75 out of 90). Only three changes of greater than two repeat units were observed. These were relatively large deletions in repeat tracts of 37, 38 and 18 units. In order to estimate the frequency with which large deletion events occurred, we examined, using PCR amplification, the repeat tracts from a further 30 variants for each of the strains with tracts of 31, 32, 37 or 38 repeat units. No other large deletion events were detected, suggesting that such events are rare. The ratio of insertions to deletions was also examined. The off-to-on direction of mutation displayed a marked bias (Table 2); this occurs because only one of the one-repeat shifts produces an in frame ORF. With the on-to-off direction, there should not be any bias because both the + 1 and the − 1 repeat shifts will produce an out-of-frame ORF. The ratio of insertions to deletions was 1:1.7, 1:1.4, 1:1.8 and 1:3.75 for tracts with 38, 32, 23 and 17 repeat units respectively (Table 2). For the on-to-off direction, overall, there were twofold more deletions than insertions.

Table 2. . Types of alterations for 5′-AGTC tetranucleotide repeat tracts of different lengths in H. influenzae. a. Expressed as the number of repeat units deleted (−) or inserted (+).Thumbnail image of

The ratio of one to two repeat unit frameshifts was 6.25:1. A prediction based on a simple extrapolation of this ratio is that three-repeat-unit frameshifts (i.e. those not detected in this system) occurred 39-fold less often than one-repeat-unit frameshifts, indicating that the large majority of mutational events was detected in our system. We conclude that the phenotypic mutation rates were similar to the genetic mutation rates and we have therefore estimated mutation rates per repeat unit. To estimate these values, mutation rates (derived using the equation of Drake, 1991; see Table 1) were plotted against repeat tract length, and a linear regression line was fitted to the data (Fig. 4). The regression lines for the on-to-off and off-to-on directions of frameshifting converge such that mutation rates would be equal at an x value of 17, where x is the number of repeat units. Above a threshold of 17 repeats, the limit of the data examined in this work, the rate for off-to-on frameshifting increased at a rate of 4 × 10−6 mutations division−1 repeat unit−1, whereas for the on-to-off direction the corresponding result would be 7 × 10−6. It is noteworthy that there is a 1.75-fold difference between the mutation rates per repeat unit for the on-to-off and off-to-on directions of phenotypic switching. This fold difference is close to the theoretical value of 2 that is predicted for situations in which there are two ‘off’ ORFs to every ‘on’ ORF and if the majority of events are ± 1 or 2 repeat units (a condition that was satisfied, see Table 2).

Figure 4.

. Influence of tetranucleotide repeat tract length on mutation rates in H. influenzae. Mutation rates were derived from the frequencies of variants using the method described by Drake (1991) (see Table 1). These rates were calculated for both directions of switching and are plotted against the total number of repeats within the repeat tract. The on-to-off and off-to-on directions of switching are represented by filled diamonds and filled squares respectively. Linear regression lines were fitted to the data using microsoft excel, the equations for these lines are: off to on, y = 4 × 10−6x − 2 × 10−5; and on to off, y = 7 × 10−6x − 7 × 10−5.

A model of the implications of phase variation rate on genetic diversity and pathogenesis

The observation of large numbers of repeat units in the phase-variable genes of H. influenzae (High et al., 1996; Hood et al., 1996) is evidence that a high phase variation rate may be important in infections by this organism. In infections that lead to meningitis, H. influenzae undergoes translocation from the site of colonization (nasopharynx), where one phenotype is adaptive, to another anatomic site, where the adaptive phenotype is different (Moxon, 1992; Weiser and Pan, 1998). This change in phenotype is, in part, mediated by phase variation, so we developed a theoretical model to investigate the influence of mutation rate on the number of generations required to change from one phenotype to another. This switch in phenotype was assumed to require three independently phase-variable genes to switch from off to on. Effects on the range of genotypes generated in a population were also observed in this model. The basic parameters of the model were as follows: populations were initiated by a single progenitor (cell); mutations occurred during division, giving rise to one progeny of the parental genotype and one of the mutant genotype; the off and on states were designated as 1 and 2 respectively (binary switches) and so a change in genotype from 1,1,1 to 2,2,2 was required; mutations occurred in both directions at the same rates; and the rates were the same for all three loci. Three different mutation rates were examined: (i) 1.24 × 10−4; (ii) 3.6 × 10−5; (iii) 1 × 10−6. The first two rates are those empirically derived for repeat tracts of 37 and 17 repeat units (Table 1) and the third rate is an arbitrary low value, higher than the mutation rate for point mutations and possibly representative of a repeat tract with six repeat units. Figure 5 shows the results of 1000 replicate experiments that were stopped at 20 generations, thus each replicate is equal to a bacterial population of 1 × 106 cells. Two points of interest are observed. First, when the loci have 17 or 37 repeat units (Fig. 5B and C), all of the populations have four or more genotypes, whereas only 26% of the populations have four genotypes when the loci have six repeat units (Fig. 5A). This result demonstrated the strong influence on genetic diversity of high mutation rates. Second, only 5.2% of the populations with 17 repeat units (Fig. 5B) at each loci have greater than four genotypes in each population, whereas the figure for 37 repeat units is 45.2% (Fig. 5C). Thus, a twofold increase in the length of a repeat tract, or a threefold increase in mutation rate, produces a ninefold increase in diversity.

Figure 5.

. A model of the effect of mutation rate on the genetic diversity of a bacterial population. A theoretical model was used to follow the generation of populations of cells from a single cell that has three independently phase-variable loci. Each locus can have two states, off and on. The starting situation was that all three loci were off. Mutations occurred in both directions at the same mutation rates. The population went through 20 generations, at which point the number of genotypes in the population was counted. The model was run 1000 times. The number of genotypes in a population is plotted against the number of populations. Note that a particular genotype could be represented by only a single cell and that the majority of cells are of the genotype that initiated the population. Three mutation rates were examined, in each case the mutation rates were the same for all three loci, the number of 5′-AGTC repeats having a similar mutation rate is given in brackets: (A) 1 × 10−6 (6); (B) 3.6 × 10−5 (17); (C) 1.24 × 10−4 (37).

In the model experiments described above, only one cell was observed to have switched at all three loci; this occurred when the loci each had 37 repeat units. In order to increase the frequency of this event, the number of generations had to be increased. However, this led to computational problems because of the sizes of the populations. This problem was simplified by removing, at generation 20, all cells with the starting genotype. We then observed that with 37 repeat units at each loci 370 out of 1000 replicates had produced cells that had switched at all three loci by generation 30. The value for loci with 17 repeat units was 21 out of 1000 replicates. Thus, a twofold increase in the length of a repeat tract, or a threefold increase in mutation rate, produces an 18-fold increase in simultaneous switching of three loci, an event that for the purpose of this model we assume to correlate with a change from the colonizing to the invasive phenotype.


The bacterium H. influenzae contains a number of microsatellites (simple repeat sequences) that cause phase variation of surface-expressed molecules, including outer-membrane proteins, iron-binding proteins and LPS. In this report, we examine the putative methyltransferase gene mod, described by Hood et al. (1996) as containing repeats. This locus was utilized to examine some general features of phase variation in H. influenzae. The following points are made: (i) this locus encodes homologues of a type III DNA restriction/modification system; (ii) this locus is usually present in natural H. influenzae strains and frequently contains tetranucleotide repeats; (iii) phase-variable gene expression occurs from this locus; (iv) mutation rates of a tetranucleotide repeat tract in the chromosome of H. influenzae increase with the number of repeats within the tract; and (v) these tracts predominately change by only a single repeat unit and there is a bias towards contraction of such tracts.

The gene mod encodes a protein with homology to the type III DNA methyltransferases of prophage P1 and Salmonella typhimurium. The adjacent downstream gene in the H. influenzae genome, res, has low homology to restriction enzymes of the type III family. This homology and organization is a strong indication that these genes indeed encode a type III restriction system. These genes are the only ones within the Rd genome to display such homology, and therefore it seems likely that HinfIII, the type III restriction enzyme purified from H. influenzae strain Rf by Kauc and Piekarowicz (1978), is encoded by mod and res. This is one line of evidence that these genes are functional. Further evidence of this was found by probing for mod in a selection of non-typeable strains. These strains were selected using a phylogenetic tree based on ribotyping, so as to have representative genetically distinct H. influenzae strains. An intact copy of res was found in all of the strains and mod was in the majority, indicating that this locus has a high level of conservation and is therefore likely to have an important biological function. Similarly, tetranucleotide repeats were found in 56% of the strains, suggesting that mod genes containing repeats are common in natural populations of H. influenzae and that there has been, and is currently, selection for strains with repeats in this locus.

The presence of repeats within a gene is a strong indicator that the gene is phase variable at a frequency determined by the hypermutability of its microsatellite. However, this has not been proved formally for many loci and has not been shown for genes encoding an R/M system. It is definitively shown in this report that expression of an active protein from the mod locus in the chromosome of H. influenzae strain Rd is switched on and off at a high frequency. It should be noted that transcription of this locus may be unaltered and that the Res protein may be produced. This protein will not be active, however, because functional type III restriction enzymes are complexes of the Mod and Res subunits. This establishes a secure basis for investigating the biological role of a phase-variable R/M system.

The effect of tract length on microsatellite instability has been examined for a variety of repeat unit sizes in a number of different organisms, but few studies have been performed in bacteria using chromosomal reporter constructs. Our experiments were performed in the bacterium H. influenzae, a pathogen that naturally contains simple repeats. This is in contrast to E. coli, the bacterium normally used for such experiments, which contains few microsatellites (Karlin et al., 1997). Our experiments have therefore direct relevance for the biology of H. influenzae, and possibly for related organisms that contain simple repeats.

Our demonstration that mutation rates of tetranucleotide repeat tracts in the H. influenzae chromosome increase as a function of the length of the tract does not differ qualitatively from conclusions made using other systems. There are, however, quantitative differences. To facilitate comparisons with other data, mutation rates were calculated using different methods (see Table 1). The caveats to such comparisons are numerous: different contexts of the repeats; use of plasmids compared with chromosomal constructs; use of different reporter systems; etc. Using plasmid constructs in yeast, Sia et al. (1997) found that (5′-CAGT)16 had a mutation rate of 4.9 × 10−6 (calculated by the fluctuation test) and Wierdl et al. (1997), using the same system, found (5′-GT)16 varied at 5.9 × 10−6, (5′-GT)25 at 2.1 × 10−5 and (5′-GT)50 at 1.5 × 10−4. Sia et al. (1997) also noted that tetranucleotide and dinucleotide tracts with similar numbers of repeat units had similar mutation rates, suggesting that dinucleotide and tetranucleotide tracts are comparable. The rate we observed for (5′-AGTC)17 (i.e. RdGΔZ17R) is 20-fold higher than the rates for similar sized tracts in the yeast system (and that is without taking into account the fact that plasmid constructs produce higher phenotypic mutation rates) and approaching the rate for the (5′-GT)50 tract. This suggests that mutation rates in yeast are substantially lower than those in H. influenzae. Using plasmid constructs in E. coli, Strauss et al. (1997) found (5′-CA)14 had a mutation rate of 4.9 × 10−5 (calculated according to Drake, 1991) and Morel et al. (1998), using chromosomal constructs, found mutation frequencies (calculated by dividing the frequency of variants by the number of generations) for (5′-AC)18 and (5′-TG)21 of 5.9 × 10−6 and 2.2 × 10−5 respectively. These rates are similar to the rates for (5′-AGTC)17, which were ≈5.4 × 10−5 (see Table 1) and 3.9 × 10−5 (as Morel et al., 1998), and suggest that repeat tracts vary at similar rates in H. influenzae and E. coli. However, this similarity may be superficial as the patterns of genetic events (i.e. deletions and insertions) in these systems were not identical.

For a tetranucleotide repeat tract in the yeast system, Sia et al. (1997) observed a range of event sizes in wild type, but mainly +/−1 repeat shifts in mutants of the mismatch repair genes. In both cases, there was a 2:1 bias of deletions to insertions. Wierdl et al. (1997) found mainly +/−1 shifts for dinucleotide tracts of 7 or 16 repeat units, with a slight bias towards insertions, but many more larger events for the longer tracts, although still with a majority of small insertions. In the E. coli systems, Levinson and Gutman (1987b) observed for (5′-CA)20–22 mainly +/−1 shifts with a 3:1 bias in deletions; Strauss et al. (1997) observed only −1 deletions (however, these experiments only examined the off-to-on direction); and Morel et al. (1998) observed, for (5′-AC)51 in the wild-type strain, many large events, which were mostly deletions, but in the mutS, uvrD and dam mutants there were mainly +/−1 shifts, although still with a bias towards deletions. The mutational events observed in the 5′-AGTC tracts of the H. influenzae RdGΔZnR strains are similar to those described above in the ratio of deletions to insertions (2:1), but differ in the low frequency of large deletions (1.4%). This result suggests that there are no major differences between the processes that generate and/or correct small mutational events in microsatellites in H. influenzae relative to E. coli and, to a lesser extent, yeast. One implication of this result is that functional microsatellites have evolved in H. influenzae because of natural selection and, conversely, are not present in E. coli because of a lack of such selective pressure, possibly negated by the evolution of more complex regulatory systems. The bias towards contraction of the 5′-AGTC repeat tracts also implies that such tracts are maintained in a bacterial population by selection. The observation of two large deletions in the larger tracts may be an indication that as tract length increases so does the number of such deletions; this may be the process that limits the upper size of repeat tracts in H. influenzae. There is still, however, no indication of how such tracts may have evolved.

This work demonstrates that in a naturally phase-variable organism, H. influenzae, there is a relationship between the length of a 5′-AGTC repeat tract and mutation rate (or phase variation rate) and describes the first accurate estimations of the phase variation rate for a gene containing simple sequence repeats in this organism. We speculate that this relationship and these mutation rates are similar for the other tetranucleotide repeat tracts of this organism. These results suggest that phase-variable genes do not contain a random number of repeats, but rather that the number of repeats is modular for a number that produces an optimal phase variation rate for the bacterial population in its natural environments. We used our results and a simple theoretical model to explore some aspects of this theory. This model incorporated two other biologically relevant experimental results. First, some H. influenzae strains are highly infectious and single organisms can initiate fatal infections of 5-day-old rats (Moxon and Murphy, 1978). In the model, therefore, we simulated a bacterial population initiated by a single organism. Second, the ‘effective mean generation time’ (replication minus clearance) of H. influenzae in infections of 5-day-old rats is 50 min (Moxon, 1992), 20 generations is therefore equivalent to 16 h and is within the time required for an invasive infection to develop. Increases in tract length from 6 to 17 and from 17 to 37 repeat units had major effects on the genetic diversity of the populations produced by 20 generations (Fig. 5). This model examined only three phase-variable loci and differences in diversity are expected to increase greatly as a function of the number of loci, of which there are 12 in the Rd strain of H. influenzae. We therefore conclude that long repeat tracts in the phase-variable genes enable H. influenzae to rapidly generate genetically diverse populations. Such diversity is thought to be critical in the pathogenesis of infections because it allows rapid adaptation to changes in the environment (Moxon et al., 1994; Magnasco and Thaler, 1996; Moxon and Thaler, 1997). The model also showed that a change in phenotype caused by switching of three phase-variable genes could be achieved in under 30 generations (or 25 h) in a significant number of cases for repeat tracts of both 37 and 17 repeat units.

Experimental procedures

Bacterial strains and culture conditions

The H. influenzae type d strain RM118 (strain KW-20) is from the same source as strain Rd used for the sequencing project (Fleischmann et al., 1995). We also used the type b strains RM153, Eagan (Anderson et al., 1972) and RM7004 (van Alphen et al., 1983). The H. influenzae non-typeable strains were 432, 375, 1247, 1158, 1159, 1200, 1268, 1124, 981, 1233, 1207, 1209, 1008, 285, 723, 486, 176, 1232, 1231, 1181, 1292, 1180 and 162 (supplied by J.Eskola and the Finnish Otitis Media Study Group). H. influenzae strains were grown at 37°C in brain heart infusion (BHI) supplemented with either haemin (10 μg ml−1) and NAD (2 μg ml−1) in liquid medium or Levinthal supplement on solid medium.

E. coli strains DH5α [genotype: supE44 ΔlacU169(Φ80 lacZΔM15 ) hsdR17 recA1 endA1 gyrA96 thi-1 relA1 ] and MC1061 [genotype: hsdR mcrB araD139 Δ(araABC-leu )7679 ΔlacX74 galU galK rpsL thi ] were used to propagate cloned plasmids and were grown at 37°C in Luria–Bertani (LB) broth supplemented with either ampicillin (100 μg ml−1) or kanamycin (50 μg ml−1).

Construction of phase-variable expression vectors

All enzymes were from Boehringer Mannheim, except T4 DNA ligase (USB). PCR was performed with oligonucleotides purchased from Genosys. Sequencing was performed on biotinylated PCR products using Dynabeads (Dynal) and the USB sequencing kit. Sequencing was also performed using the dRodhamine or Big-Dye (Perkin Elmer) sequencing kits and an ABIprism 377 (Perkin Elmer). Southern blots were hybridized with DIG-labelled probes generated using the DIG High Prime kit and anti-DIG Fab fragments (Boehringer Mannheim).


The res ORF was amplified using PCR with RESN and RESC and the PCR product was cloned into the SacI/Sal I sites of pGEM11Zf(+) (Promega). A HindIII–BamHI restriction fragment containing the lacZ ORF was then excised from the pCH110 vector (Pharmacia) and inserted into the corresponding sites of this plasmid. The resulting plasmid was called pGreslacZ. This version of the lacZ ORF contains a KpnI site near its 5′ end that was used for inserting PCR products comprising mod fragments (see below and Fig. 1). The final step was to excise the Tn903 kanR gene from the pUC4K vector (Pharmacia) and insert this gene at the BamHI site downstream of the lacZ. Transformants with the kanR gene in the opposite transcriptional direction relative to lacZ were identified and used for H. influenzae transformations.

Plasmid pGΔZK-wt contained a wild-type number of mod repeats (i.e. 40 5′-AGTC); the Δ indicates that this plasmid will produce a deletion of mod in the H. influenzae chromosome, the Z refers to lacZ and the K to the antibiotic cassette. This plasmid was constructed by amplifying the mod repeat region from genomic DNA of H. influenzae strain Rd with two primers (H1RNH2 and KPN1MOD) that span the repeats (this PCR generated a 1 kb fragment containing the promoter of mod and all its repeats), digesting the products with HindIII and KpnI and inserting into the same sites in pGreslacZ, the vector described above. For the construction of cassettes with variable numbers of repeats, two PCRs were used (see Fig. 2). The first PCR was the same as just described. The second PCR was realised on 0.5 pg (7 × 104 molecules) of the first PCR product, in a 100 μl volume, with primer concentrations of 0.1 μM for the primer hybridizing upstream (H1RNH2) of mod and 1 μM for the SLIP primer, which hybridized in the repeat tract. These PCR products were extracted from agarose gels (QiaexII kit, Qiagen) after electrophoresis to remove contaminants and ligated into the pCR2.1 vector (Invitrogen). Several hundreds of transformants were pooled, and the plasmid DNA was prepared and restricted with HindIII and KpnI and was inserted in pGreslacZ. Here again, transformants were pooled and the kanR gene was inserted as described above. Individual transformants were screened by amplifying the repeat tract with primers MODP2 and LACZB1 and sequencing with primer MODP2, using either Dynabeads and the USB kit or the Perkin-Elmer sequencing kits. Six plasmid constructs (termed pGΔZnR; where n is the number of repeats) were isolated, these contained repeat tracts of 6, 10, 18, 24, 31 and 38 repeats. Plasmid pGΔZ24R had a T → C mutation 20 nucleotides upstream of the first repeat.

Insertion of constructs into H. influenzae

Plasmids were linearized by digestion with ScaI and used to transform competent H. influenzae (Herriott et al., 1970). Transformants were selected on BHI plates containing Levinthal supplement and 10 μg ml−1 kanamycin. Transformants were checked by probing Southern blots of genomic DNA digests with a probe that hybridized to the region upstream of mod and using PCR and sequencing of the repeat region as described above. H. influenzae strains, containing repeat tracts of 5, 11, 17, 23, 32 and 37 repeats, were isolated from the H. influenzae strains constructed with the pGΔZnR plasmids by picking phenotypic variants.

Measurement of mutation rates

The H. influenzae transformants were streaked on BHI plates containing Levinthal supplement and Xgal (5-bromo-4-chloro-3-indolyl-β-d galactopyranoside; 40 μg ml−1). Isolated colonies were resuspended in 500 μl of BHI, and 50 μl of 100-, 1000-, 10 000- and 100 000-fold dilutions were plated on BHI supplemented with Levinthal supplement and Xgal (40 μg ml−1). The number of variant colonies and the total number of colonies (reflecting the total number of bacteria in the initial colony) were counted. For each construct and phenotype, the total number of bacteria per colony was averaged for all the colonies that were examined. This value and the value for the median frequency of variants (or the average of the two median frequencies) were used to estimate the mutation rates. Mutation rates were derived using two methods: the equation of Drake, i.e. μ = 0.4343f/log( ), where μ is the mutation rate, f is the median frequency and N is the average population size; (Drake, 1991); and the fluctuation test, i.e. f = M [1.24 + ln (M )], from which the mutation rate can be derived using μ = M/N, where M is the average number of mutations per culture (Lea and Coulson, 1949). In each case, μ or M were derived by reiteratively solving the relevant equations.

Analysis of tract lengths of variants

Phenotypic variant colonies from the above experiments were harvested into 100 ml of sterile distilled water and then boiled for 10 min. The mod repeat region was amplified using PCR with primers MODP2 and LACZB1 that span the repeat tract, and these products were sequenced by either manual or automated sequencing (as described above).

A theoretical model of phase variation

The program for the model was written in C code and will be made available by the authors on request. The model followed the division of cells with three independently phase-variable loci. Mutations were simulated by the computer randomly selecting a number from the reciprocal of the mutation rate, entered by the user. A mutation was deemed to have occurred when the computer selected a 1.


  1. *Present address: Laboratory of Immunology–Microbiology, URBM, University of Namur, Rue de Bruxelles 61, 5000 Namur, Belgium.

  2. †The first two authors contributed equally to this work.


The authors would like to thank sincerely Professor Juhani Eskola and all the members of the Finnish Otitis Media Study Group at the National Public Health Institute in Finland for the provision of NTHi strains from the inner ear fluid, obtained as part of the Finnish Otitis Media Cohort Study. We thank Anna Richardson and Ali Cody for help with sequencing. X.D.B. was supported by a FNRS Royal Society exchange fellowship and by an EMBO short-term fellowship. D.F. was supported by an NSF Sloan Fellowship in Molecular Evolution. N.J.S. was supported by a Wellcome Trust Research Fellowship in Medical Microbiology. This work was supported by programme grants from the Wellcome Trust and the Medical Research Council.