The mRNA interferases, MazF-mt3 and MazF-mt7 from Mycobacterium tuberculosis target unique pentad sequences in single-stranded RNA


*E-mail; Tel. (+1) 7322354116; Fax (+1) 7322354559.


mRNA interferases are sequence-specific endoribonucleases encoded by toxin-antitoxin (TA) systems in bacterial genomes. Previously, we demonstrated that Mycobacterium tuberculosis contains at least seven genes encoding MazF homologues (MazF-mt1 to -mt7) and determined cleavage specificities for MazF-mt1 and MazF-mt6. Here we have developed a new general method for the determination of recognition sequences longer than three bases for mRNA interferases with the use of phage MS2 RNA as a substrate and CspA, an RNA chaperone, which prevents the formation of secondary structures in the RNA substrate. Using this method, we determined that MazF-mt3 cleaves RNA at UU˘CCU or CU˘CCU and MazF-mt7 at U˘CGCU (˘indicates the cleavage site). As pentad sequence recognition is more specific than those of previously characterized mRNA interferases, bioinformatics analysis was carried out to identify M. tuberculosis mRNAs that may be resistant to MazF-mt3 and MazF-mt7 cleavage. The pentad sequence was found to be significantly underrepresented in several genes, including members of the PE and PPE families, large families of proteins that play a role in tuberculosis immunity and pathogenesis. These data suggest that MazF-mt3 and MazF-mt7 or other mRNA interferases that target longer RNA sequences may alter protein expression through differential mRNA degradation, a regulatory mechanism that may allow adaptation to environmental conditions, including those encountered by pathogens such as M. tuberculosis during infection.


mRNA interferases are sequence-specific endoribonucleases encoded by toxin-antitoxin systems that are present in a wide range of bacterial genomes (Zhang et al., 2003b). These endoribonucleases usually have specific cleavage sites within single-strand RNA; for example, Escherichia coli MazF specifically cleaves at ACA sequences (Zhang et al., 2005b), ChpBK at ACY sequences (Y is U, A, or G) (Zhang et al., 2005a), PemK from plasmid R100 at UAH sequences (where H is C, A, or U) (Zhang et al., 2004), and MazF-mt1 and MazF-mt6 from Mycobacterium tuberculosis at UAC and U rich regions respectively (Zhu et al., 2006). Recently, a MazF homologue from Staphylococcus aureus was reported to cleave at VUUV′ (where V and V′ are A, C, or G and may or may not be identical) (Fu et al., 2007).

The overproduction of MazF in E. coli causes complete cell growth arrest, resulting in a quasi-dormant state in which the cells still retain full capacity for protein synthesis (Suzuki et al., 2005). The genes for MazF homologues are usually coexpressed with their cognate antitoxin genes, encoding MazE, and appear to function primarily to regulate cell growth under stress conditions. Interestingly, a subset of genes encoding MazF homologues, are not coexpressed with their cognate antitoxins; for example, Myxococcus xanthus, a developmental soil-dwelling bacterium, contains a solitary mazF gene, whose developmental expression is essential for obligatory lysis of cells during differentiation to form fruiting bodies (Nariya et al., 2008). Some bacteria contain more than one MazF homologue. As we recently demonstrated there are 60 TA systems encoded by the M. tuberculosis genome, of which at least seven are MazF homologues, suggesting that these MazF homologues may have individual roles in the physiology of this pathogen (Zhu et al., 2006). Interestingly in contrast to M. tuberculosis, the non-pathogenic mycobacterium, M. smegmatis has only two TA systems (mazEF and phd/doc) in its genome (Pandey and Gerdes, 2005).

Our previous attempts to determine the cleavage specificities of some of the MazF homologues of M. tuberculosis failed. One possible reason for this could be a low frequency of toxin-specific target sequences, which could result from differences in GC content and/or more stringent (longer) sequence requirements for these toxins. For this reason, in the present paper, we developed a general method to determine the cleavage-site specificity for mRNA interferases using bacteriophage MS2 RNA. This commercially available RNA is 3569 bases in length, and sufficiently complex to allow determination of the cleavage-site specificity for sites that are longer than three bases. However, MS2 RNA contains extensive secondary structures [more than 70%; (Chon et al., 2002)] and has a relatively high melting temperature (58°C). Therefore, this RNA is highly resistant to endoribonucleases or mRNA interferases, which cleave only single-stranded regions of RNA. In order to make the entire sequence of the RNA susceptible to mRNA interferases, we sought to facilitate melting of secondary structures in the RNA. For this purpose, we used CspA, the major cold-shock protein of E. coli, which functions as an RNA chaperone (Jiang et al., 1997).

Using this method, we identified the cleavage sites for MazF-mt3 and MazF-mt7 to be CU˘CCU/UU˘CCU and U˘CGCU (˘ indicates the cleavage site) respectively. This approach enabled us to detect almost all the cleavage sites for MazF-mt3 as well as MazF-mt7, demonstrating the sensitivity and general applicability of the method for identifying cleavage sites targeted by sequence-specific endoribonucleases.


Development of a general method to determine the cleavage specificity of mRNA interferases using CspA, an RNA chaperone

MazF-mt3 is a M. tuberculosis homologue of E. coli MazF. Previously we demonstrated that induction of MazF-mt3 causes cell growth inhibition in E. coli (Zhu et al., 2006). We attributed this toxic effect of MazF-mt3 to its mRNA interferase activity; however, we were not able to determine its cleavage specificity, possibly due to its highly specific cleavage sequences. Particularly for the determination of the cleavage specificity of mRNA interferases that recognize a specific sequence longer than three bases, it is essential to use an RNA substrate that is long enough to cover all possible target sequences; for example an RNA substrate should be longer than 1024 bases (=45) to contain all possible five base cleavage sites, assuming that the RNA contains an equal number of each base. For this reason, we chose bacteriophage MS2 RNA, consisting of 3569 bases and having a fairly even base content (26% G, 23% A, 26% C and 25% U) as a substrate for the determination of the cleavage specificity of MazF-mt3. A major problem in using MS2 RNA (and any long RNA) as a substrate, however, is that although it is a single-stranded RNA, it forms extensive stable secondary structures [see Fig. 1A; (Chon et al., 2002)]. Therefore, in order to use this RNA as a substrate for endoribonucleases, its secondary structures have to be unfolded. As the major cold-shock protein of E. coli, CspA is known to function as an RNA chaperone and prevent the formation of secondary structures in RNAs (Jiang et al., 1997), we first attempted to develop a general method for determination of cleavage specificity of endoribonucleases using MS2 RNA in the presence of CspA.

Figure 1.

Secondary structure of MS2 RNA and enhancement of MazF-mt3 endoribonucleases activity of CspA.
A. Predicted secondary structure of MS2ssRNA by MFOLD ( Minimum Folding Energy at 37°C is −1460.69 kcal mole−1. The red arrows indicate cleavage sites of MazF-mt3 identified by primer extension analysis and the blue dotted arrow indicates the predicted cleavage sites of MazF-mt3 that were not detected by primer extension analysis.
B. Enhancement of endoribonuclease activity by RNA chaperone CspA. The full-length MS2 mRNAs were partially digested at 37°C with (lanes 3–6) or without (lanes 1 and 2) purified N-terminal His-tagged MazF-mt3, with (lanes 2 and 5, 32 μg; lane 4, 16 μg; lane 6, 64 μg) or without (lanes 1 and 3) purified CspA protein. The digestion reaction mixture (10 μl) consisted of 1.6 μg of MS2 RNA substrate, 1.6 μg of MazF-mt3(His)6, 16, 32 or 64 μg CspA and 0.5 μl of RNase inhibitor (Roche) in 10 mM Tris-HCl (pH 7.8). The reaction products were run on a 1.2% (1× TBE) agarose gel.

We purified (His)6MazF-mt3 protein expressed from pET-28a-MazF-mt3 as previously described for MazF-mt1 and -mt6, using Ni-nitrilotriacetic acid resin (Zhu et al., 2006). When purified MazF-mt3 was incubated with MS2 RNA in the absence of CspA, partial cleavage of the RNA was observed (Fig. 1B, lane 3). As increasing amounts of CspA protein were added to the reaction mixture, the cleavage of the MS2 RNA substrate was enhanced (lanes 4–6). At the highest concentration of CspA no full-sized MS2 RNA was detected (lane 6). This concentration (0.86 mM) is 32-fold higher than the Kd value of CspA for RNA binding, indicating that the MS2 RNA was almost completely saturated with CspA, which binds every six bases of RNA (Jiang et al., 1997). These results also demonstrate that MazF-mt3 is an mRNA interferase capable of cleaving single-stranded RNAs that are generated when CspA unfolds MS2 secondary structure.

MazF-mt3 specifically cleaves RNA at CUCCU and UUCCU

Next, we carried out primer extension experiments to determine the cleavage sites of MazF-mt3 in the MS2 RNA using the reaction mixture in the presence of 0.43 mM CspA. In order to cover the entire MS2 RNA sequence, a total of 22 primers were synthesized for the primer extension experiments. As shown in Fig. 2A–H, in most cases the addition of CspA, significantly enhanced RNA cleavage. Notably, the RNA cleavage was detectable only in the presence of CspA for the cleavage site shown in Fig. 2F. The cleavage of the RNA at base 1028 (Fig. 2B) appears to be reduced in the presence of CspA, which is due to enhanced cleavage of a site at base 1078 immediately downstream of the cleavage site at base 1028, located between the primer binding site and the upstream cleavage site. This highly enhanced cleavage site at base 1078 in the presence of CspA is shown in Fig. 2C. Through these experiments, eight cleavage sites were identified, as listed in Table 1. The consensus sequence from these cleavage sites is CUCCU, where MazF-mt3 cleaves between the U residue in the second position and the C residue in the third position. Although we identified eight MazF-mt3 cleavage sites, there are nine CUCCU sequences in the MS2 RNA according to the published MS2 sequence (RefSeq: NC_001417 from NCBI website). We therefore re-sequenced the region containing the ninth CUCCU sequence and found that there is no CUCCU sequence in the region. The MS2 RNA used in these experiments (Roche) contained CCUCU (base 2158 to base 2162) instead of CUCCU. Therefore, all the CUCCU sequences in MS2 RNA were cleaved by MazF-mt3.

Figure 2.

MazF-mt3 cleaves specifically at CUCCU sites.
A–H. In vitro cleavage of the MS2 RNA with (His)6MazF-mt3. Lane 1 represents a control reaction in which no proteins were added; lane 2, a control reaction in which only CspA protein was added; lane 3, MS2 RNA only with (His)6MazF-mt3; lane 4, MS2 RNA incubated with (His)6MazF-mt3 and CspA protein. Cleavage sites are indicated by red arrows on the RNA sequence and were determined using the RNA ladder shown on the right. In (H) the RNA ladder was compressed between the C (base 3390) and G (base 3394) residues, but we were able to elucidate the sequence though comparison of the upstream and downstream regions with the standard sequence.

Table 1.  Cleavage sites of MazF-mt3 in MS2 RNA. Thumbnail image of

We also detected four other cleavage sites which are different from those found in Fig. 2, as shown in Fig. 3A–D. The consensus sequence from these primer extension experiments is UUCCU, where MazF-mt3 cleaves between the U residue in the second position and the C residue in the third position. There are five UUCCU sequence in the MS2 RNA all of which were confirmed by RNA sequence. At present it is not clear why the fifth sequence (from base 1804 to base 1808; see Table 1) cannot be cleaved by MazF-mt3. Under the conditions used, the region containing the sequence might not be fully unfolded even in the presence of CspA. All the cleavage sites for MazF-mt3 are shown by red arrows in Fig. 1A.

Figure 3.

MazF-mt3 cleaves specifically at UUCCU sites.
A–D. In vitro cleavage of the MS2 RNA with (His)6MazF-mt3. Lane 1, control reaction in which no proteins were added; lane 2, control reaction in which only CspA protein was added; lane 3, MS2 RNA incubated only with (His)6MazF-mt3; lane 4, MS2 RNA incubated with (His)6MazF-mt3 and CspA protein. Cleavage sites are indicated by red arrows (strong cleavage sites) or pink arrows (weak cleavage sites) on the RNA sequence and were determined using the RNA ladder shown on the right.

The target pentad sequence for MazF-mt3 is underrepresented in PE and PPE family proteins in M. tuberculosis

It is tempting to speculate that the pentad sequences identified as the cleavage sites for MazF-mt3 (CUCCU and UUCCU) might allow more stringent control of specific protein synthesis under distinct conditions, either by protecting mRNAs from being cleaved or by being more sensitive to MazF-mt3 endoribonuclease activity. Therefore, we examined the entire M. tuberculosis H37Rv genome to search the open reading frames (ORFs) that contain the pentad sequences at a much lower frequency than expected or conversely at a much higher frequency than expected, calculated from the frequency of each pentad sequence in the gene based on the nucleotide compositions of each ORF.

Interestingly, we found that there are certain genes that have the MazF-mt3 pentad sequences at a much lower frequency than expected. Using a cut-off of P < 0.01 we identified 10 such genes in the genome. Interestingly, four of these 10 genes were identified as members of the PPE gene family, which has been proposed to play a role in the immunopathogenicity of M. tuberculosis (Brennan and Delogu, 2002) (Table 2). Three of these four genes, PPE54, PPE55 and PPE56, are in a single locus in the chromosome of M. tuberculosis though they do not appear to be part of a single operon. All four of these PPE genes encode members of the major polymorphic tandem repeat subfamily of PPE proteins (PPE_MPTR) (Gey van Pittius et al., 2006). Substituting n = 10 and m = 4 in Eq. 2, this over-representation of the PPE family has a P-value less than 0.0006. This result suggests that the mRNAs from these genes may be more resistant to MazF-mt3 cleavage. Extending this analysis to genes with a lower than expected number of cleavage sites with P < 0.02, we found a total of 20 genes ((including the 10 with a probability lower than 1%). Seven out of these 20 genes (35%) were found to belong to the PE or PPE families (see Data S1). Substituting n = 20 and m = 7 in Eq. 2, this overrepresentation of the PE and PPE proteins has a P-value less than 0.00002. This analysis indicates that the MazF-mt3-recognizing pentad sequences are underrepresented in specific members of the PE and PPE family suggesting that the mRNAs for these genes are relatively resistant to MazF-mt3 compared with most mRNA's encoded in the M. tuberculosis genome.

Table 2.  The top 10 MazF-mt3-resistant genes in the M. tuberculosis genome.
CDS positionLengthExpected motif countsActual motif countsProbabilityLocusGene

MazF-mt7 is also a sequence-specific mRNA interferase

To validate the general applicability of the method, we used MS2 RNA-CspA system to identify the specific cleavage sites of MazF-mt7. Using purified N-terminal His-tagged MazF-mt7, primer extension experiments were carried out and the results are shown in Fig. 4A–R and summarized in Table 3. Most of the cleavage sites contain the sequence UCGCU (Fig. 4A–K), where MazF-mt7 cleaves between first U and second C. Some cleavage sites have a one-base mismatch from the consensus UCGCU (Fig. 4C, D, H, L–P), and a few cleavage sites have a two-base mismatch (Fig. 4G, N, Q and R). All these cleavage sites, however, share the central G residue and most of them also have a C residue following the central G residue. MazF-mt7 thus is an mRNA interferase that recognizes a five-base U˘CGCU sequence; however, it appears to be less stringent than MazF-mt3.

Figure 4.

Cleavage of MS2 RNA by MazF-mt7.
A–R. In vitro cleavage of the MS2 RNA with (His)6MazF-mt7. Lane 1, control reaction in which no enzymes were added; lane 2, control reaction in which only CspA protein was added; lane 3, MS2 RNA incubated only with (His)6MazF-mt3; lane 4, MS2 RNA incubated with (His)6MazF-mt3 and CspA protein. Cleavage sites are indicated by red arrows (strong cleavage sites) or pink arrows (weak cleavage sites) on the RNA sequence and were determined using the RNA ladder shown on the right.

Table 3.  Cleavage sites of MazF-mt7 in MS2 RNA. Thumbnail image of

The target pentad sequence for MazF-mt7 is underrepresented in PE and PPE genes in M. tuberculosis

Similar bioinformatics analysis to that described above was carried out to search for M. tuberculosis mRNAs that are resistant or hypersusceptible to cleavage by MazF-mt7 (see Data S2). In the top 10 MazF-mt7-resistant genes (Table 4), there are four genes belonging to the PE or PPE family. Three out of 10 are PPE_MPTR and one out of 10 is a PE_PGRS. Most significantly, both PPE55 and PPE56 are resistant not only to MazF-mt7 but also to MazF-mt3. PPE55 is known to be highly immunogenic, and is expressed during incipient, subclinical M.  tuberculosis infection. PPE56 is highly homologous (67%) to PPE55 (Singh et al., 2005). These data suggest that the expression of different PE and PPE family proteins may be regulated at certain stages of infection by mRNA interferases and raises the possibility that regulation of protein expression by TA systems may play a role in M. tuberculosis pathogenesis.

Table 4.  The top 10 MazF-mt7-resistant genes in the M. tuberculosis genome.
CDS positionLengthExpected motif countsActual motif countsProbabilityLocusGene


In the present report, we first developed a general method for the determination of cleavage specificities of mRNA interferases using the 3.5-kb MS2 RNA and CspA, an RNA chaperone. In our earlier work with MazF (Zhang et al., 2003b), ChpBK (Zhang et al., 2005a), PemK (Zhang et al., 2004) and MazF-mt1 and MazF-mt6 (Zhu et al., 2006), in vivo primer extension was sufficient to determine their cleavage site specificities, because all these mRNA interferases recognize only three-base sequences in mRNAs. On the other hand, we were unable to determine the cleavage site specificity for MazF-mt3 by in vivo primer extension experiments, because very few cleavage sites were found. We reasoned that this might be due to a longer recognition sequence for MazF-mt3 cleavage. Using the 3.5 kb MS2 RNA, we indeed identified several MazF-mt3 cleavage sites. The further identification of the sequences recognized by MazF-mt7 supports the general applicability of the method for the determination of the cleavage site specificities for mRNA interferases that recognize four- or five-base sequences. It should be emphasized that the use of CspA is very important to remove the secondary structures in the RNA, to allow the mRNA interferase access to the entire 3.5 kb RNA. Using this technology, we were able to identify two specific five-base sequences, CU˘CCU and UU˘CCU, for MazF-mt3, and also U˘CGCU for MazF-mt7. As previously reported (Zhu et al., 2006) the MazF-mt3 has a putative 11-residue S1–S2 loop similar to E. coli MazF, whereas the MazF-mt7 has a much shorter (only four residues) S1–S2 loop. As the S1–S2 loop has been implicated in stabilizing the interaction of MazF with its mRNA substrates, the S1–S2 loop structures of MazF-mt3 and -mt7 are consistent with our observation. Although both MazF-mt3 and MazF-mt7 target pentad sequences, the MazF-mt3 has a higher specificity for sequence reorganization in comparison with MazF-mt7.

Discovery of mRNA interferases having higher cleavage specificities raises intriguing questions about their function. These enzymes have been proposed to be activated in response to stresses, causing a global shutdown of translation by degradation of mRNAs (Inouye, 2006). This global response may lead to a transient state of metabolic ‘dormancy’ or eventual cell death as an adaptive response at the cell or population level. In addition, mRNA interferases with greater specificity such as MazF-mt3 may regulate specific mRNAs in the cell either negatively (by cleaving them) or positively (by not being able to cleave them). This differential susceptibility to cleavage of individual mRNAs by more specific mRNA interferases, may allow a focused response to specific stresses or environmental conditions, rather than the generalized response attributed to TA systems encoding mRNA interferases with three-base target sequences that have been characterized to date.

The pathogen M. tuberculosis contains a remarkably large number of TA systems (as many as 60) including at least seven MazF homologues (Pandey and Gerdes, 2005; Golby et al., 2007) in contrast to the non-pathogenic mycobacterium, M. smegmatis. Despite a much larger genome (7 vs. 4.4 megabases) M. smegmatis has only two TA systems including one MazF homologue. These facts suggest an intriguing hypothesis regarding the role of TA systems in M. tuberculosis: individual TA system may be differentially activated in response to different environmental conditions in the host during the course of infection. Furthermore, individual mRNA interferases may uniquely target specific groups of mRNAs as part of an adaptive response, and thus contribute to the pathogenesis of this organism.

Though this hypothesis remains to be tested, our finding that a subset of the PE and PPE family genes encode mRNAs that may be relatively protected from degradation by MazF-mt3 and MazF-mt7 is consistent with this prediction. Though the function of most genes in this family is obscure at present, a number of these proteins have been shown to localize to the mycobacterial cell wall/cell surface, and several have been shown to provoke an immune response in vivo or play a role in host–pathogen interactions (Brennan and Delogu, 2002; Denny and Smith, 2004; Basu et al., 2007; Mishra et al., 2008). In addition, it has been shown that members of these families are differentially expressed in different tissues or at distinct times during growth in vitro (Delogu et al., 2006; Dheenadhayalan et al., 2006). Thus, to the extent that MazF-mt3 and MazF-mt7 can affect the production of specific PE or PPE family proteins in vivo, these mRNA interferases could play a role in the pathogenesis of tuberculosis.

It is very interesting that the target pentad sequences of MazF-mt3 and MazF-mt7 are underrepresented in the PPE_MPTR subfamily. Members of this group are present in multiple copies only in M. tuberculosis and its close relatives, and are totally absent in the fast-growing mycobacteria, including M. smegmatis. This finding suggests that these genes have been positively selected during the evolution of these slow-growing pathogens, raising an interesting possibility that the target specificity of these mRNA interferases might have played a role in the evolutionary process to expand the PE/PPE family in the slow-growing pathogenic mycobacteria.

As many PE family proteins (notably the PE_PGRS group) are rich in glycine and alanine, the low frequency of MazF-mt3 and MazF-mt7 sequences could be secondary to codon usage driven by processes independent of MazF-mt3 and MazF-mt7 targeting. The fact that only a small fraction of the more than 168 PE and PPE family genes have significantly fewer cleavage sites than predicted, however, suggests that MazF-mt3 and MazF-mt7 interferase activities may degrade these mRNAs to a lesser extent than those encoding other PE or PPE family proteins.

The present work demonstrates a novel and generalizable approach to the identification of mRNA interferases target sequences, including those with sequence requirements longer than the three-base targets previously recognized. The identification of the pentad target sequences required by MazF-mt3 and MazF-mt7 suggests the possibility of differential mRNA degradation as a novel regulatory mechanism for adaptation to environmental conditions. This work thus opens up a new avenue for determining of the sequence specificity of the full complement of mRNA interferases in M. tuberculosis and other organisms, and for investigating whether sequence-specific mRNA degradation plays a regulatory role in microbial adaptation, including adaptation that contributes to virulence in pathogenic microorganisms such as M. tuberculosis.

Experimental procedures

Strains and plasmids

E. coli BL21 (DE3) strain was used for recombinant protein expression. Plasmid pET-28a-MazF-mt3 and -mt7 were constructed from pET-28a (Novagen) to express (His)6MazF-mt3 and -mt7 respectively.

Purification of (His)6 tagged MazF-mt3 and MazF-mt7 in E. coli

MazF-mt3 and MazF-mt7 (His)6 tagged at the N-terminal end were purified from strain BL21(DE3) carrying pET-28a –MazF-mt3 and MazF-mt7 using Ni-NTA resin (Qiagen) as described previously (Zhang et al., 2003a).

Purification of CspA protein from E. coli

CspA was purified as described previously (Chatterjee et al., 1993).

Primer extension analysis in vitro

For primer extension analysis of mRNA cleavage sites in vitro, the full-length MS2 mRNAs were partially digested with or without purified toxin protein (MazF-mt3 or MazF-mt7), and with or without purified CspA protein at 37°C for 15 min. The digestion reaction mixture (10 μl) consisted of 0.8 μg of MS2 RNA substrate, 0.0625 μg of MazF-mt3(His)6 or MazF-mt7(His)6, 32 μg CspA and 0.5 μl of RNase inhibitor (Roche) in 10 mM Tris-HCl (pH 7.8). Primer extension was carried out at 47°C for 1 h in 20 μl of the reaction mixture as described previously (Zhang et al., 2005b). The reactions were stopped by adding 12 μl of sequence loading buffer (95% formamide, 20 mM EDTA, 0.05% bromphenol blue, and 0.05% xylene cyanol EF). The samples were incubated at 90°C for 5 min prior to electrophoresis on a 6% polyacrylamide and 36% urea gel. The primers used for primer extension analysis of the MS2 mRNA are listed in Table 5. The primers were 5′-labelled with [γ-32P]ATP using T4 polynucleotide kinase.

Table 5.  The primers used for primer extension analysis of the MS2 RNA.
Primer nameSequencePrimer nameSequence

Bioinformatic analysis of the frequencies of MazF-mt3 or -mt7 motifs in M. tuberculosis coding sequences

We retrieved the genomic sequence of M. tuberculosis H37Rv from NCBI RefSeq (Accession NC_000962), and extracted all coding sequences (CDS) from the record. We first calculate the nucleotide composition of each CDS. The probability, p, of the two motifs CUCCU or UUCCU appearing anywhere in the CDS is:


Let L be the length of the CDS. Then the expected number, E, of the motifs in the CDS is:


Let K be the actual number of the motifs in the CDS. Then the probability, P, of having K or fewer motifs in the CDS is:


A CDS with a very small P-value suggests that it may have evolved to eliminate the two motifs from its sequence.

Bioinformatic analysis of the overrepresentation of the PE family among the ORFs where the numbers of MazF-mt3 cleavage sites are less than expected

The NCBI RefSeq record shows that M. tuberculosis has 3989 ORFs. The PE and PPE families together contain around 168 genes ( From the proceeding analysis, we identify a set of genes where the numbers of cleavage sites are significantly less than expected. Let n be the number of such significant genes, and let m be the number of PE genes among these n genes. Then the significance of the over-representation of the PE family follows the hypergeometric distribution:



This work was partially supported by a research fund from Takara-Bio. Inc. Japan to IM and by a grant from the National Institute of Allergy and Infectious Disease to RNH.