Contribution of phage-derived genomic islands to the virulence of facultative bacterial pathogens


  • Ben Busby,

    Corresponding author
    • National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
    Search for more papers by this author
    • These authors contributed equally to this work.
  • David M. Kristensen,

    1. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
    Search for more papers by this author
    • These authors contributed equally to this work.
  • Eugene V. Koonin

    1. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
    Search for more papers by this author

For correspondence. E-mail; Tel. (301) 594 2698; Fax (301) 480 0298.


Facultative pathogens have extremely dynamic pan-genomes, to a large extent derived from bacteriophages and other mobile elements. We developed a simple approach to identify phage-derived genomic islands and apply it to show that pathogens from diverse bacterial genera are significantly enriched in clustered phage-derived genes compared with related benign strains. These findings show that genome expansion by integration of prophages containing virulence factors is a major route of evolution of facultative bacterial pathogens.

Most of the known bacterial pathogens have closely related benign counterparts among environmental organisms. What makes a facultative pathogen different from its non-pathogenic relatives? The answer is complex and involves at least two major evolutionary routes. One is the reductive path whereby benign bacteria lose certain metabolic genes, which makes them dependent on their host for at least some essential nutrients. This path has been investigated in much detail and the current view holds that the environmental strains first convert to an obligate host-dependent lifestyle and then, during the massive genome reduction that follows, lose regulatory genes and become pathogenic (Andersson and Kurland, 1998; Moran and Mira, 2001; Vissa and Brennan, 2001; Merhej et al., 2009; Clarke, 2011; Merhej and Raoult, 2011; Belda et al., 2012). Although this model might explain the origin of many obligate pathogens, it cannot be universal, as many bacterial pathogens retain the ability to live freely outside of their hosts, and their genome sizes actually tend to increase during the conversion to pathogenicity (Ho Sui et al., 2009; Botzman and Margalit, 2011).

For these facultative pathogens, the degree of damage inflicted on the host is ultimately determined by the interplay of the organism's genetic make-up and the environmental cues given by the host (Casadevall and Pirofski, 2000). A critically important role is provided not by the core set of housekeeping genes but by so-called virulence factors, often encoded by foreign genes, horizontally acquired from other bacteria, and/or integrative elements such as prophages or plasmids. In many environments, phages far outnumber their host bacteria, potentially making the phage contribution to the evolution of the pathogen genomes particularly important. It has been observed that that horizontal gene transfer is much more extensive in the human gut (an extremely ‘crowded’ habitat) than in other environments characterized to date (Smillie et al., 2011), although the contribution of prophages to this transfer is likely underrated (Reyes et al., 2012). Horizontally acquired genes often cluster together in bacterial genomes resulting in genomic islands that differ from the surrounding genes in their GC content, dinucleotide frequencies, codon usage and other characteristics (Hacker and Kaper, 2000; Guy, 2006; Soares et al., 2012).

Over a decade ago sequencing and comparison of complete genomes from pathogenic and laboratory strains of Escherichia coli revealed that the major differences between these strains were due to the insertion of prophages and other mobile elements, many of which have since undergone extensive decay (Burland et al., 1998; Perna et al., 1998; 2001; Ohnishi et al., 1999; Plunkett et al., 1999; Hayashi et al., 2001; Welch et al., 2002; Asadulghani et al., 2009). Since that time, comparative analysis of the vastly increased collection of complete bacterial genomes has demonstrated that accretion of pathogenicity islands (a type of genomic island found primarily in pathogens) is a general mechanism by which many bacteria undergo conversion to pathogenicity (e.g. Hensel and Schmidt, 2008; Mohammadi-Barzelighi et al., 2011; Trost et al., 2012). Importantly, the dynamics of island accretion is much faster than the dynamics of either speciation or of changes in the core bacterial genome (Castillo-Ramirez et al., 2011). Moreover, certain pathogenic strains of E. coli appear to have acquired their prophage insertions independently (Reid et al., 2000), with several insertions specific to different isolates of O157:H7 (Eppinger et al., 2011).

After integration, prophages and other mobile elements undergo extensive decay, making recognition of the genomic (pathogenicity) islands a technically challenging task. Also, the genomes of phages harbour the greatest amount of genetic diversity on the planet, with an estimated 90% still remaining unknown which limits the utility of approaches that rely on similarity to known genes (Angly et al., 2006). To evaluate the contribution of pathogenicity islands, specifically intact and degraded prophages, to the virulence of facultative bacterial pathogens, we compared the characteristics of islands detected by several independent methods of prophage detection (Fouts, 2006; Leplae et al., 2010; Akhter et al., 2012), We also employed the recently updated collection of Phage Orthologous Groups (POGs, available at which represent a collection of evolutionarily conserved gene families observed in at least 3 distinct complete phage genomes (Kristensen et al., 2011) to detect clusters of genes transferred from phages to bacteria.

Figure 1 shows the mapping of genes homologous to POGs [identified by profile psi-blast (Altschul et al., 1997) searches] within the genomes of two of the first E. coli strains to be sequenced, the pathogenic O157:H7 Sakai (Watanabe et al., 1999), and the benign laboratory strain K-12 MG1655 (Blattner et al., 1997). Clusters of phage genes that likely represent inserted prophage regions are shown in pink, and prophages predicted by several other methods are shown for comparison, including (inner boxes, in black) known prophages identified by manual inspection of the genomes (Hayashi et al., 2001; Asadulghani et al., 2009). The regions highlighted in purple are those of the completely automated heuristic program Phage_Finder (Fouts, 2006); in brown are the Prophinder results available in the ACLAME database (Leplae et al., 2010), which relies on similarity to known phage genes and examination of integration sites and performs automated genome analysis followed by manual curation; and in blue are the results from the newer PhiSpy tool (Akhter et al., 2012) that combines similarity- and composition-based strategies.

Figure 1.

Intact and degraded prophage regions identified by several approaches in the pathogenic Escherichia coli O157:H7 strain Sakai (left) and the benign laboratory strain K-12 MG1655 (right).

Phage genes that appear in clusters are more likely to have arisen by insertion of a prophage than isolated phage gene homologues, even if much of the prophage has since degraded. Even without using sophisticated techniques employed by many prophage identification methods (e.g. examining integration sites etc.), clusters of POGs manage to reproduce the general locations of all of the known prophages in O157:H7 Sakai, with most clusters containing at least one phage-specific gene (phage structural proteins, terminase, integrase etc.). Good agreement is also found with known prophages in K-12 except for the cryptic CP4 regions (which might not be prophage-derived) as well as one additional region previously annotated as a ‘prophage-like’ region (Asadulghani et al., 2009). It has been shown that many of the prophages in O157:H7 Sakai, despite having obvious defects, retained the ability to be excised from the host genome, replicate, have their DNA packaged, be released from the cell, and in a few cases even productively infect other bacteria, often with the help of other prophages present in the same cell, thus leading to an overall potentiation of horizontal gene transfer (Asadulghani et al., 2009). Thus, many of the phage gene clusters detected by the POG-based approach are likely to contribute to the evolution of pathogenicity islands even if they are not complete prophages.

Obviously, the pathogenic O157:H7 Sakai genome contains many more clusters of phage genes as well as genes encoding virulence factors associated with phage genes than the genome of the benign laboratory strain K-12 (Fig. 1). We defined a set of phage-associated virulence factors by performing literature searches for each POG, looking for proteins whose differential expression (mostly knockout) had a minimal effect on growth in other environments but measurably affected pathogenicity (Brown et al., 2012) (see Table S1). Acquisition of pathogenicity islands seems to be a general trend among the pathogenic E. coli strains, and among pathogens from other bacterial phyla as well, often bringing additional secretion systems (such as the Type III system that is rarely observed in non-pathogenic E. coli), toxins, adhesins, invasins, modulins, and many other virulence genes that modulate host cell activities, such as providing pathogens with the ability to invade the epithelial cell layer (Hacker and Kaper, 2000).

To investigate this trend of acquisition of pathogenicity islands in diverse bacteria, we compiled a collection of 107 completely sequenced genomes from organisms that retain the capability to live freely outside of their hosts, and represent four diverse genera from two different phyla, namely 27 Escherichia and 21 Pseudomonas (both Gammaproteobacteria), 21 Burkholderia (Betaproteobacteria) and 38 Bacillus (Firmicutes) strains. This data set includes roughly equal numbers of pathogens and non-pathogens (53 and 54 respectively), and represents several diverse examples of relatively recent and parallel conversions to pathogenicity. For each organism, pathogenicity status was assigned based on the data available in KEGG, GOLD, BioProject, BacMap and Human Oral Microbiome databases (Barrett et al., 2012; Chen et al., 2010; Cruz et al., 2012; Kanehisa et al., 2012; Pagani et al., 2012; Sayers et al., 2012) or in the DOE Joint Genome Institute Portal (Grigoriev et al., 2012) and ascertained from literature searches for the specific strains involved (see Table S2). To identify phage-derived genes in these genera, we used the procedure outlined above. Clusters were defined as at least seven consecutive windows (a window is at least four POG matches in a succession of 10 genes), allowing for up to 16 genes without POG matches between them.

Figure 2 shows the comparison of several characteristics potentially related to pathogenicity in the four genera examined. Similar to O157:H7 Sakai, the genomes of the other 16 E. coli pathogens have significantly more (predicted) protein-coding genes on average than benign strains of Escherichia (nine E. coli and one E. fergusonii) (Fig. 2A). To a large extent, the additional genes in the pathogens are accounted for by phage-derived genes as measured by matches to POGs (Fig. S1). Most of these phage-derived genes are organized in clusters that correspond to integrated prophages or their decayed remnants (Fig. 1) that are considered to have been acquired independently by different E. coli strains (Reid et al., 2000). The same trend generally holds for pathogens from other genera although certain non-pathogenic Burkholderia spp. possess large genomes (Fig. 2A). Typically, POG profiles yield significantly more blast matches per genome for pathogens than for the related benign strains (Fig. 2B) even when normalized for genome size (Fig. 2C). Furthermore, the pathogen genomes show a stronger tendency for clustering of phage-derived genes (Fig. 2D and E), possibly from more recent insertion of prophages, and in three of the four genera, these clusters tend to be larger (Fig. 2F), compatible with the same conclusion. Finally, in some pathogens, these clusters are enriched in presumably phage-derived virulence factors (Fig. 2G and H). All the trends towards enrichment for phage-derived genes and (putative) prophages were far more pronounced for the pathogenic strains of Escherichia than for the bacteria from other genera (Fig. 2). Although it is impossible to rule out that evolution of pathogenicity in Escherichia differs from that in other bacteria, we find it much more likely that the stronger trends are a result of the far better characterization of phages that infect E. coli [21% of all sequenced phages with known hosts infect Escherichia, compared with < 10% for each of the other genera (Kristensen et al., 2011)]. Should that be the case, the results obtained for Escherichia should be considered better representative of the actual processes accompanying the evolution of pathogenicity in free-living bacteria than the results for the other genera.

Figure 2.

Significant differences between the pathogenic strains of four genera and the benign strains of the same species. For all panels, each two-column set represents (from left to right) Escherichia (27 genomes), Bacillus (38 genomes), Pseudomonas (21 genomes) and Burkholderia (21 genomes); pathogenic strains are in black, benign strains are in white. Asterisks indicate P-value < 0.05 by Student's or Welsh's t-test, if the sets had equal or unequal variance respectively. The panels are as follows: (A) genes per genome (excluding plasmids); (B) POG matches per genome; (C) POG matches per gene; (D) clusters per genome; (E) per cent of POG matches in clusters; (F) cluster size; (G) (known) virulence factors per cluster; (H) per cent of (known) virulence factors in clusters.

The results of the present analysis indicate that genome expansion by integration of prophages containing virulence factors is a major route of evolution by which free-living bacteria convert from a benign state to pathogenicity. A striking example outside of these four genera analysed here is the normally commensal Neisseria meningitides that, upon chromosomal integration of a phage, converts to a pathogenic form that can cross the blood–brain barrier and kill a previously healthy individual within hours (Bille et al., 2005). Indeed, it has been more than half a century now since it was first discovered that the integration of a prophage can convert previously benign Corynebacterium diphtheria bacteria into a pathogenic form (Freeman, 1951; Uchida et al., 1971). Vibriophages carrying genes necessary for cholera epidemics represent yet another well-studied example (Sen and Ghosh, 2005). As genome sequencing inexorably continues, genomic data sets are getting too large and complex to use brute force algorithms, so simple and efficient approaches like the one applied here for phage gene detection are becoming especially useful for detecting trends across these large data sets.


The authors would like to thank Andrew Edwards, Sivakumar Kannan, Alexander Lobkovsky, Kira Makarova, Nobuto Takeuchi, Yuri Wolf and Natalya Yutin for helpful discussions. The authors' research is supported by the Intramural Research Program of the US National Institutes of Health at the National Library of Medicine.