Preliminary analysis and annotation of the partial genome sequence of Francisella tularensis strain Schu 4

Authors


Titball Defence Evaluation and Research Agency, CBD Porton Down, Salisbury, Wilts., SP4 0JQ, UK (e-mail: rtitball@dera.gov.uk).

1. Summary, 614

2. Introduction, 614

3. Materials and methods, 615

3.1. Sequencing and assembly, 615

3.2. Sequence analysis, 615

3.3. Gap closure strategy, 615

4. Results, 615

4.1. Selection of strain for genome sequencing, 615

4.2. Initial assembly, 615

4.3. Confirmation of authenticity of library, 616

4.4. Preliminary annotation, 616

4.5. Amino acid biosynthesis, 617

4.6. Purine biosynthesis, 617

5. Discussion, 617

6. References, 619

1. SUMMARY

Francisella tularensis, the aetiological agent of tularemia, is an important pathogen throughout much of the Northern hemisphere. We have carried out sample sequencing of its genome in order to gain a greater insight into this organism about which very little is known, especially at the genetic level. Nucleotide sequence data from a genomic DNA shotgun library of the virulent F. tularensis strain Schu 4 has been partially assembled to provide 1·83 Mb of the genome sequence. A preliminary analysis of the F. tularensis genome sequence has been performed and the data compared with 20 fully sequenced and annotated bacterial genomes. Plasmid-encoded genes, previously isolated from low virulence strains of F. tularensis, were not identified. A total of 1289 potential coding ORFs were identified in the data set. An analysis of this data revealed 413 ORFs which would encode proteins with no homology to known proteins. ORFs which could encode proteins involved in amino acid and purine biosynthesis were also identified. These biosynthetic pathways provide targets for the construction of a defined attenuated mutant of F. tularensis for use as a vaccine against tularemia.

2. INTRODUCTION

Francisella tularensis, a small, Gram-negative facultative intracellular coccobacillus, is the aetiological agent of tularemia. Francisella tularensis is a member of the genus Francisella and is in the gamma subdivision of the Proteobacteria. Four subspecies of F. tularensis have been identified based on their biochemical characteristics and virulence. Francisella tularensis subspecies tularensis (also known as type A) and subspecies holarctica (also known as type B) are most frequently associated with disease in humans (Tärnvik 1989). The Francisellaceae family also includes various symbionts of ticks, including Wolbachia persica (Forsman et al. 1994; Noda et al. 1997) and an endosymbiont of the wood tick Dermacentor andersoni (Niebylski et al. 1997).

Tularemia is a zoonotic disease occurring throughout most of the Northern hemisphere (Tärnvik 1989). Rodents are thought to be the main reservoir of the bacterium, with ticks as the main vectors and also a possible reservoir (Hubalek and Halouzka 1997). Two clinical forms of the disease predominate: ulceroglandular, or glandular tularemia, is vector-borne or caused by contact with an infected animal, while pneumonic tularemia is a consequence of the inhalation of dust contaminated with the bacterium (Ericsson et al. 1997). Another form of disease, intestinal tularemia, is a result of the ingestion of infected foodstuffs (Stewart 1996) or water containing the bacteria (Tärnvik et al. 1996; Berdal et al. 2000). Francisella tularensis is a highly infectious pathogen, with as few as 10 organisms capable of causing disease in humans (Golovliov et al. 1997). Tularemia can be treated with antibiotics and streptomycin, gentamicin and tetracycline are currently considered the antimicrobials of choice (Limaye and Hooper 1999). If not treated, type A tularemia is associated with a significant mortality, being fatal in approximately 10% of cases of ulceroglandular disease and 80% of cases of pneumonic disease.

Immunospecific protection against tularemia can be induced either by natural infection or by vaccination with the live vaccine strain of F. tularensis (LVS), whereas killed vaccines reportedly induce no protection against disease (Tärnvik et al. 1992). The origin of the LVS strain is uncertain. In 1956, a mixture of attenuated strains was transferred from the Soviet Union to the USA and from these strains the LVS was selected (Tärnvik et al. 1992). Studies have shown that immunization with LVS reduces the incidence of respiratory tularemia in at-risk employees, but not the incidence of ulceroglandular tularemia (Burke 1977). Further limitations of LVS include the fact that the molecular basis of its attenuation is unknown and also that the conditions used to culture the vaccine can influence the degree of attenuation (Cherwonogrodzky et al. 1994). Against this background there is a need for an improved vaccine against tularemia.

Little is known about the genetic makeup of F. tularensis or how it causes disease. The study reported here, involving the sample sequencing of a F. tularensis strain Schu 4 genomic library, aims to rectify this lack of knowledge and to provide the information needed to help develop a defined vaccine, either in the form of an attenuated mutant or a subunit vaccine.

3. MATERIALS AND METHODS

3.1. Sequencing and assembly

Francisella tularensis was cultured and DNA was isolated as described previously (Karlsson et al. 2000). The sequence was generated with a shotgun strategy (Fleischmann et al. 1995). The shotgun library consisted of 1·5–2·0 kb genomic DNA fragments cloned into pUC18. The sequencing was performed using dRhodamine or BigDye terminators on ABI Prism 377 and 373 DNA sequencers.

After 13 904 shotgun sequences were obtained, giving a statistical genome coverage of approximately five-fold, a first assembly was carried out using the PHRED/PHRAP base-caller and assembly programs (Ewing and Green 1998; http://bozeman.mbt.Washington.edu/phrap.docs/phrap.html) applying default parameters.

3.2. Sequence analysis

Contigs from the initial assembly were used to create a local database, which was searched for specific genes by probing with known protein sequences from other bacteria (obtained from GenBank) using TBLASTN (Altschul et al. 1997). Some contigs were analysed in detail with DNAStar software (Lasergene, Madison, WI).

For analysis of the genomic data, open reading frames (ORFs) were identified and the deduced amino acid sequences were compared with the Swissprot and Genpept protein databases using BLASTP. To identify possible coding regions missed due to low quality sequence and short or interrupted ORFs, the genomic sequence was split into 1 kb parts and compared with Genpept using BLASTX. Analysis of metabolic pathways was done with reference to the Kyoto Encyclopedia of Genes and Genomes (http://www.genome.ad.jp/kegg/;Ogata et al. 1999).

3.3. Gap closure strategy

Information from forward and reverse shotgun reads was linked to enable gaps between contigs, covered by a single clone, to be closed by primer walking. Closure of some gaps was also achieved by combinatorial PCR and prediction of flanking regions for the contigs using conserved gene order in other bacteria.

4. RESULTS

4.1. Selection of strain for genome sequencing

Francisella tularensis strain Schu 4 was chosen for sequencing as it belongs to the most virulent subspecies tularensis, is a clinical isolate with a documented history and is virulent in a suitable animal model. Strain Schu 4 was originally isolated from a human case of tularemia (Eigelsbach et al. 1951) and recently a clonal isolate of the bacterium has been shown to have a median lethal dose (MLD) in the murine model of disease of less than 1 cfu (Russell et al. 1998). DNA prepared from F. tularensis strain Schu 4 was used to construct a random fragment library in the plasmid pUC18.

4.2. Initial assembly

Sample sequencing of this library provided 13 904 nucleotide sequence reads (average read length 740 bp; Karlsson et al. 2000). In total, this sequence provided an estimated five-fold coverage of the F. tularensis genome, which is estimated to be 1·8–2·0 Mb (Karlsson et al. 2000). The initial assembly produced 353 contigs with an average length of 5·2 kb, giving in total 1·83 Mb of nucleotide sequence data with a G + C content of 33·2%. At this stage, confirmation of authenticity of the library was carried out, as well as analysis of the genomic sequence data for possible biosynthetic pathways to interrupt in order to construct a rationally attenuated auxotrophic vaccine.

4.3. Confirmation of authenticity of library

All F. tularensis proteins whose sequences had previously been deposited with NCBI/GenBank were searched for in the genomic data using BLASTX. Of the 34 reported complete F. tularensis proteins, genes which could encode 25 of these proteins were identified. However, nine genes were not identified in the dataset (Table 1). All of these genes are reported to be located on either plasmid pOM1, which was isolated from F. tularensis LVS (GenBank accession AFO55345), or plasmid pNFL10 which was isolated from F. tularensis var novicida.

Table 1.   Previously reported F. tularensis genes not identified in F. tularensis strain Schu 4 genomic data Thumbnail image of

4.4. Preliminary annotation

A preliminary analysis of the whole genome was carried out and a comparison of the genes of various bacteria with those found in the F. tularensis genome is available at http://www.medmicro.mds.qmw.ac.uk/ft/. The average G + C content within putative ORFs was higher (by ≥ 0·6%) than the G + C content of the entire available data set (33·8%). Within the putative ORFs, the G + C content at the three codon positions had a distinct signature (Table 2). In total 1804 candidate ORFs were identified in the data set of which 1289 were thought to encode proteins (Table 2). This analysis also indicated that on average there was one predicted coding ORF for every 1·42 kb of the genome. In contrast, for the other 20 bacterial genomes surveyed (Table 3) on average one predicted coding ORF was present for every 0·93 kb of the genome.

Table 2.   Preliminary analysis of ORFs in F. tularensis strain Schu 4 Thumbnail image of
Table 3.   Number of ORFs in 20 completed bacterial genomes (compiled from the Comprehensive Microbial Resource at The Institute for Genomic Research, http://www.tigr.org/) Thumbnail image of

The putative proteins encoded by the F. tularensis sequence data were assigned to one of 16 different categories, to enable a comparison with other bacteria whose genomes have previously been sequenced. With the Comprehensive Microbial Resource at The Institute for Genomic Research (http://www.tigr.org/), data on the categorization of proteins for 20 sequenced bacterial genomes (Table 3) was obtained. This data was compared with the categorization of F. tularensis putative proteins. (Fig. 1). The average genome size of these 20 surveyed bacteria is 2 Mb, which should allow meaningful comparisons to be drawn with the F. tularensis genome data.

Figure 1.

 Preliminary annotation of the F. tularensis strain Schu 4 genome sequence; comparison of the number of F. tularensis genes in 15 functional categories (solid bars). The mean number of genes in these categories in 20 other bacterial species is shown as open bars, with error bars indicating the lowest and highest numbers of genes in each category

This analysis showed that F. tularensis Schu 4 has a typical number of genes which encode putative proteins required for DNA translation, DNA transcription and DNA replication. Genes which could encode purine and pyrimidine biosynthesis proteins and cofactor biosynthesis proteins were also represented at an expected level. In contrast, the number of putative proteins in F. tularensis Schu 4 with no match in the GenBank database was higher than that reported in any of the other microbial genomes which were included in this comparison. Conversely, the number of ORFs which could encode proteins with a match to other bacterial hypothetical proteins was lower than in the other microbial genomes. Putative proteins which play a role in transport/binding, gene regulation, energy metabolism and cellular processes were also under-represented in the data set.

4.5. Amino acid biosynthesis

In comparison with the other bacterial genomes analysed, ORFs which could encode proteins associated with amino acid biosynthesis were represented at an above expected level in F. tularensis Schu 4. One subset of these putative proteins were analysed, which are required for aromatic amino acid biosynthesis, in more detail. With the genomic sequence data, the shikimate pathway, as shown in Fig. 2, was elucidated for F. tularensis (Karlsson et al. 2000). Homologues of all the necessary enzymes for the pathway to be functional were identified in F. tularensis Schu 4. To confirm that the shikimate pathway was functional, F. tularensis Schu 4 was cultured in defined media (Karlsson et al. 2000), with and without added tyrosine. The bacteria were able to grow in both media (data not shown).

Figure 2.

 Shikimate pathway in F. tularensis Schu 4, reconstructed from genome sequence data. Genes identified in F. tularensis Schu 4 are shown in bold type. Alternative enzymes which are able to catalyse this step in other bacterial species are also shown

Escherichia coli possesses three structurally related phospho-2-dehydro-3-deoxyheptonate aldolase enzymes (AroF, AroG and AroH, Fig. 2) capable of catalysing the first step in the biosynthetic pathway. Expression of the encoding genes is repressible by tyrosine, phenylalanine or tryptophan, respectively (Kwok et al. 1995). For the fifth step in the pathway, shikimate to shikimate-5-phosphate, E. coli possesses two shikimate kinases (AroK and AroL) capable of catalysing the reaction. Our analysis of the F. tularensis sequence data indicates only one phospho-2-dehydro-3-deoxyheptonate aldolase is present (AroG), and one shikimate kinase (AroK), as is the case in Haemophilus influenzae.

4.6. Purine biosynthesis

In comparison with the other bacterial genomes analysed here, an expected number of putative proteins required for purine and pyrimidine biosynthesis were identified in F. tularensis strain Schu4 (Karlsson et al. 2000). All of the genes encoding proteins necessary for the de novo synthesis of purine nucleotides were identified (Fig. 3).

Figure 3.

 The purine biosynthesis pathway in F. tularensis Schu 4, reconstructed from genome sequence data. Genes identified in F. tularensis Schu 4 are shown in bold type

5. DISCUSSION

The partial genome sequence of F. tularensis has already provided a significant enhancement to our knowledge of the genetic makeup of this organism. Our sequence data indicates that the plasmids, identified in other low virulence strains of F. tularensis (pOM1 and pNFL10), are absent in the highly virulent Schu 4 strain. These plasmids therefore cannot contain essential virulence genes. However, we are not yet able to determine whether any other extrachromosomal DNA elements are present in strain Schu 4.

Our preliminary analysis indicates that in F. tularensis Schu 4, the number of coding ORFs for each kb of sequence data was lower than that seen in most of the other bacteria analysed here. It is possible that this reflects the fact that the sequence is not yet fully assembled and edited, so that some ORFs may be unrecognized due to frameshift errors. Alternatively, the proteins encoded by F. tularensis Schu 4 may be, on average, larger than those found in other bacterial species. However, it is interesting to note that in the case of Rickettsia prowazekii, another intracellular pathogen, a high percentage (24%) of the genome is thought to be noncoding, containing spacer sequences and inactivated genes (Andersson et al. 1998).

In F. tularensis, a greater than expected number of ORFs appeared to encode proteins which showed no homology to proteins in any of the public sequence databases. One contributory factor to this could be the lack of existing genetic information available for the Francisella genus; there are currently only 57 entries in GenPept database for the whole group.

The number of ORFs predicted to encode purine, pyrimidine and amino acid biosynthesis enzymes in F. tularensis was also greater than the number of enzymes present in these pathways in the other bacterial species analysed here. The shikimate pathway is functional in many bacteria and is the common pathway for the biosynthesis of chorismate. Chorismate is the precursor for the generation of aromatic amino acids (phenylalanine, tryptophan and tyrosine), folate, coenzymes, vitamins (such as the benzenoid and naphthenoid coenzymes Q and vitamin K) and a broad range of aromatic secondary metabolites, including the 2,3-dihydroxybenzoate-containing molecules required for iron uptake and accumulation (Walsh et al. 1990). The fact that all of the genes necessary to encode the shikimate pathway enzymes are present in the genome of F. tularensis strain Schu 4, combined with the bacterium’s ability to grow in a defined medium lacking aromatic amino acids, indicates that this pathway is functional (Karlsson et al. 2000).

Although the inactivation of genes in the shikimate pathway has resulted in attenuation of many pathogens, including Salmonella typhimurium (Hosieth and Stocker 1981), Salmonella typhi (Stocker 1988), Neisseria gonorrhoeae (Chamberlain et al. 1993), Aeromonas salmonicida (Vaughan et al. 1993) and Pasteurella multocida (Homchampa et al. 1992), the mechanism of attenuation has not been fully defined. Our finding that for each step of the shikimate pathway only one enzyme capable of catalysing the reaction appears to be coded for on the F. tularensis genome suggests that each of the genes in this pathway are targets for mutation to stop the synthesis of chorismate.

Blocking the de novo synthesis of purines has also been shown to attenuate several pathogens, including Yersinia pestis (Brubacker 1970), Bacillus anthracis (Ivanovics et al. 1968), Salmonella dublin and Salm. typhimurium (McFarland and Stocker 1987). These studies have shown that the level of attenuation achieved varies depending on the position at which the pathway is blocked. Differences in the degree of attenuation are also shown between species, with mutations in different positions of the pathway being more attenuating in one species than another. Sequencing the genome of F. tularensis has identified all of the genes necessary for the de novo synthesis of purines in this organism, which will aid our strategy for the construction of purine auxotrophs.

An alternative to the construction of a rationally attenuated auxotrophic vaccine is the development of a subunit vaccine. Various surface antigens of F. tularensis have been evaluated as components of a subunit vaccine (Sjöstedt et al. 1992; Fulop et al. 1995; Golovliov et al. 1995). However, these studies have not led to the identification of a protective antigen. With the availability of the genome sequence data for F. tularensis Schu 4 it is now possible to take a more systematic approach to the identification of protective antigens as the genes encoding candidates have been identified. Computational methods can be used to determine, with some degree of confidence, a protein’s cellular localization. All putative proteins derived from the genome sequence could therefore be screened for those with a cellular localization spanning from the inner membrane to outside the bacterium, narrowing the number of candidates for recombinant protein production and subsequent protection studies. This methodology has been used to identify vaccine candidates from the genome sequence ofNeisseriameningitidis (Pizza et al. 2000) and a similar method of computational screening has been used to identify secreted proteins from the genome sequence of Mycobacterium tuberculosis (Gomez et al. 2000).

Ancillary