Prophages and bacterial genomics: what have we learned so far?


E-mail; Tel. (+1) 801 581 5980; Fax (+1) 801 581 3607.


There is something fascinating about science. One gets such wholesale returns of conjecture out of such a trifling investment of fact.

Mark Twain 1883
Life on the Mississippi

Summary Bacterial genome nucleotide sequences are being completed at a rapid and increasing rate. Integrated virus genomes (prophages) are common in such genomes. Fifty-one of the 82 such genomes published to date carry prophages, and these contain 230 recognizable putative prophages. Prophages can constitute as much as 10–20% of a bacterium's genome and are major contributors to differences between individuals within species. Many of these prophages appear to be defective and are in a state of mutational decay. Prophages, including defective ones, can contribute important biological properties to their bacterial hosts. Therefore, if we are to comprehend bacterial genomes fully, it is essential that we are able to recognize accurately and understand their prophages from nucleotide sequence analysis. Analysis of the evolution of prophages can shed light on the evolution of both bacteriophages and their hosts. Comparison of the Rac prophages in the sequenced genomes of three Escherichia coli strains and the Pnm prophages in two Neisseria meningitidis strains suggests that some prophages can lie in residence for very long times, perhaps millions of years, and that recombination events have occurred between related prophages that reside at different locations in a bacterium's genome. In addition, many genes in defective prophages remain functional, so a significant portion of the temperate bacteriophage gene pool resides in prophages.

Prophage biology

The genomes of cellular organisms are often littered with both functional and defunct viral chromosomes. For example, the human genome is about 8% retrovirus genes (Lander et al., 2001), and some bacterial genomes may be composed of as much as 20% bacteriophage genes (Casjens et al., 2000). Clearly, in order to understand these genomes completely, we must be able to recognize these viral genes and understand any effects they may have on the host cells.

Bacteriophages, the viruses that infect bacteria, are extremely varied. Different types of phage virions may carry single- or double-stranded (ds)DNA or RNA, and the details of their replication cycles reflect this diversity. The dsDNA phages, the subject of this review, can be grossly divided into lytic and temperate virus groups, each of which is extremely diverse. Lytic dsDNA phages infect bacterial cells and always programme the synthesis of progeny virions, which are then released from the dead, infected cell. Temperate dsDNA phages, on the other hand, although they are able to propagate lytically under some circumstances, are also able to establish a stable relationship with their host bacteria in which the phage DNA is replicated in concert with the host's chromosome, and virus genes that are detrimental to the host are not expressed. This long-term, apparently benign, association of bacteriophages with bacterial cells was first described in the 1920s (Gildmeister and Herzberg, 1924; Bail, 1925; Bordet, 1925), but its acceptance and an understanding of the real nature of this association took many years (Lwoff, 1953; 1966). Subsequent work has shown that, during this association, the phage DNA (now called the ‘prophage’) is usually physically integrated into one of the native replicons of the host (Campbell, 1962; Freifelder and Meselson, 1970); however, a few phages, such as P1, N15, LE1, φ20 and φBB-1, are not integrated and exist as circular or linear plasmids (Ikeda and Tomizowa, 1968; Ravin and Shulga, 1970; Inal and Karunakaran, 1996; Eggers et al., 2000; Girons et al., 2000). Different individuals of a given integrating temperate phage always have the same unique integration site on the phage chromosome, but may or may not always integrate their DNA at precisely the same site in the bacterial chromosome. In Escherichia coli, for example, phage λ DNA normally integrates at only one site, phage P2 DNA can quite readily integrate into at least 10 sites (Barreiro and Haggard-Ljungquist, 1992), and phage Mu DNA integrates essentially randomly into host DNA (Harshey, 1988). Bacteriophage virions can be released from cells containing an intact prophage by a process called induction, during which prophage genes required for lytic growth are turned on and progeny virions are produced and released from the cell. Cells carrying a prophage are called ‘lysogens’ because of this potential to induce and lyse. Induction can happen spontaneously and randomly in a small fraction of the bacteria that harbour a given prophage, or specific environmental signals can cause simultaneous induction of a particular prophage in many cells. A number of the important ‘model system’ dsDNA tailed phages were first discovered after they were released from lysogenic bacteria in the laboratory; for example, phages λ (Lederberg, 1951), P22 (Zinder and Lederberg, 1952), P1 and P2 from the same E. coli strain (Bertani, 1951), P4 (Six, 1963) and N15 (Ravin, 1968) were originally isolated in this manner. Most genes, including those required for lytic growth and virion production, are turned off in integrated prophages but, in the few studied cases, plasmid prophages typically express most of their non-lysis, non-virion assembly genes. Some of the genes that are expressed from the prophage in a lysogen are ‘lysogenic conversion’ genes, which alter the properties of the host bacterium. The products of these genes can have very important effects on the host bacterium, which range from protection against further phage infection to increasing the virulence of a pathogenic host. This subject has been frequently and recently reviewed and will not be covered in depth here (see Bishai and Murphy, 1988; Cheetham and Katz, 1995; Waldor, 1998; Miao and Miller, 1999; Boyd et al., 2001; Banks et al., 2002; Boyd and Brussow, 2002; Wagner and Waldor, 2002; Casjens and Hendrix, 2003). The presence or absence of prophages can account for a large fraction of the variation among individuals within a bacterial species, and phages are likely to be important vehicles for horizontal transfer of genetic information between bacteria (Ohnishi et al., 2001; Banks et al., 2002; Casjens and Hendrix, 2003). Clearly, in order fully to understand the information in bacterial whole-genome nucleotide sequences, it is essential that we be able to recognize and understand prophages when they are present. The medical and evolutionary importance of prophages makes this all the more urgent.

Types of prophages and related entities

Fully functional prophages can induce a round of lytic growth to initiate; however, not all prophage-like entities in bacterial genomes encode functional bacteriophages. Four additional types of prophage-related entities have been characterized: defective and satellite prophages, bacteriocins and gene transfer agents. (i) Defective prophages (sometimes called ‘cryptic prophages’, although in theory this term could include fully functional prophages that have never been induced to lytic growth) are prophages that are in a state of mutational decay. Although they may still harbour functional genes, defective prophages are unable to programme the full phage replication cycle (reviewed by Campbell, 1994; 1996). Several defective prophages in E. coli K-12, Rac (Kaiser and Murray, 1979), e14 (Greener and Hill, 1980), DLP12 (Lindsey et al., 1989) and QIN (Espion et al., 1983) (Table 1) and in Bacillus subtilis, 186 (PBSX; Krogh et al., 1996) and SKIN (Takemaru et al., 1995; Mizuno et al., 1996), were discovered before genomic sequencing became possible and have been studied in some detail. Each of these harbours some functional genes. For example, Rac encodes the RecE homologous recombination system (Kaiser and Murray, 1979), QIN harbours intact cell lysis genes (Espion et al., 1983), and PBSX encodes the synthesis of a virion-like particle (Okamato et al., 1968). (ii) Satellite phages are otherwise functional phages that do not carry their own virion structural protein genes, and have chromosomes that have been evolutionarily designed to be encapsidated by the virion proteins of other specific phages. The best understood example of such a parasitic relationship is that between satellite phage P4 and fully functional phage P2 (see also Ruzin et al., 2001). P4 carries genes that encode proteins that replicate its own DNA, which turn on the virion protein genes of the P2 prophage and modify the P2 head to be smaller and only able to accommodate the smaller P4 chromosome (Bertani and Six, 1988). (iii) Some bacteria produce bacteriocins (devices that kill other bacteria) that resemble phage tails (e.g. Gratia, 1989; Thaler et al., 1995; Zink et al., 1995; Nguyen et al., 1999; Nakayama et al., 2000). Two of these that have been characterized, the type F and R bacteriocins of Pseudomonas aeruginosa PAO1, are similar to phage λ tails and phage P2 tails respectively (Nakayama et al., 2000). The gene clusters encoding them have nearly complete sets of λ and P2 tail gene homologues in nearly the same order as they are found in those phages. (iv) Finally, gene transfer agents (GTAs) are encoded by some bacterial genomes (Yen et al., 1979; Starich et al., 1985; Rapp and Wall, 1987; Humphrey et al., 1997). GTAs are tailed phage-like particles that encapsidate random fragments of the bacterial genome. These particles cannot propagate as viruses, as the vast majority of the particles do not carry the genes that encode the GTA and, in the cases that have been studied, those that do contain a DNA fragment that is too short to include the full set of GTA genes. These virion-like particles can deliver their DNA payload into another bacterium of the same species, where the DNA can replace the resident cognate chromosomal region by homologous recombination. The best characterized GTA is encoded by a cluster of genes on the Rhodobacter capsulatus chromosome (Lang et al., 2000; Lang and Beatty, 2001). Although not all the proteins encoded by the genes in this GTA cluster have been characterized in detail, the number of genes involved make it likely that it will contain the genes for the structural components of the virion-like particles and little else. Do the tail-like bacteriocins and GTAs have a positively selected function or are they simply defective prophages that happen by chance to be able to perform these functions that serve no important purpose for the host? There are several arguments for such a selected function. (i) They are often universally present in species that harbour them. The Brachyspira hyodysenteriae GTA has been found in every isolate of that species that has been examined (T. Stanton and G. Thompson, personal communication), as has the R. capsulatus GTA (Wall et al., 1975), and the F and R bacteriocins were present in all of the nine P. aeruginosa strains examined (Nakayama et al., 2000). (ii) They do not appear to be in a state of evolutionary decay as pseudogenes (used here to mean any mutationally inactivated gene) have not been identified within them; and (iii) expression of their genes appears to be regulated differently from the phages to which they are related (Nakayama et al., 2000). In spite of this accumulated knowledge, it is often not possible to distinguish among functional prophages and these prophage-like entities by simply examining their nucleotide sequences. For example, a tail gene cluster in a bacterial chromosome could encode a bacteriocin or simply be what remains of a partly deleted prophage. Induced PBSX encapsulates host DNA, and its virion-like particles kill B. subtilis cells that do not carry PBSX (McDonnell et al., 1994) (it has not been demonstrated to be able to transduce other bacteria with its packaged DNA, but this remains a possibility). Is PBSX a GTA, a bacteriocin or a decaying prophage? Because of such current unknowables, in this discussion I will usually not attempt to distinguish fully functional prophages from defective prophages, satellite prophages, GTAs or phage-like bacteriocins and will include them all within the term ‘prophage.’ I will only consider the temperate dsDNA-tailed phages of bacteria, although temperate phages with ssDNA containing filamentous virions are known that integrate as dsDNA prophages (Waldor and Mekalanos, 1996; Chang et al., 1998; Davis et al., 1999; Lin et al., 2001; da Silva et al., 2002), and not yet well-studied lytic and temperate dsDNA tailed phages that infect Archaea are known (for example, see Pfister et al., 1998; Klein et al., 2002; Tang et al., 2002).

Table 1. . Prophages in three E. coli genomes.
E. coli K-12aE. coli O157 EDL933aE. coli O157 SakaiaPhage type
  • a

    . Each row represents a different integration site. In some cases (e.g. QIN site), rearrangements have made it difficult to tell whether they have identical attachment sites. This list was compiled from the following publications and references therein: Blattner et al. (1997); Rudd (1999); Hayashi et al. (2001b); Ohnishi et al. (2001); Perna et al. (2001; 2002). Elements are listed in order clockwise around the standard E. coli map; see Table S1 in Supplementary material for a list of the genes thought to lie within each prophage. Only λ and 933W have been shown to be fully functional phage genomes. Duplicate morphogenesis functions suggest that CP-933X may be evolved from two original prophages. The correspondence between Sp9 and Sp11 + Sp12 and CP-933O and CP-933P, respectively, is complex because of an inversion in the EDL933 strain lineage that involved these prophages, and other rearrangements that have occurred among the prophages (Perna et al., 2002).

  • b

    . These elements are possibly phage derived, but do not carry any uniquely phage-derived genes. CP4-6, CP4-44 and CP4-57 of K-12 and SpLE2 of Sakai are probably phage derived, but convincing proof of this is lacking (see text); in Sakai, SpLE1 and SpLE4, not shown in this table, have some similarity to the CP4 elements (Blattner et al., 1997; Rudd, 1999; Hayashi et al., 2001b). The CP4 elements are not closely related to the prophages at the same location in the other strains.

  • c

    . Phage λ was cured from the sequenced version of E. coli K-12 (Blattner et al., 1997); Eut [also called CPZ-55 (Rudd, 1999) and ‘CP-unnamed’ (Hayashi et al., 2001b)] is missing from some extant K-12 laboratory strains (Kofoid et al., 1999); Rac and e14 are also excisable (Evans et al., 1979; Brody et al., 1985).

  • d

    . CP-22 is a provisional name for a region not formally identified as a prophage by Perna et al. (2001). CPS-53 (Rudd, 1999) has also been called KpLE1.

CP4-6bCP-933I, CP-933HSp1, Sp2Lambdoid, P4-like
CP-933CSp7Unstudied type
e14CP-933X (2?)Sp8Lambdoid
CP-933O (2–4)Sp9Lambdoid
QINCP-933PSp11, Sp12Lambdoid
CP-933TSp13Somewhat P2-like
PR-XP2-like, highly deleted
CP-933 VSp15Lambdoid
CPS-53dCP-22dSp16P22-like, highly deleted
EutcP22-like, highly deleted

Prophage abundance

Should we expect prophages to be present in bacterial genome sequences and, if so, how many? In addition to the anecdotal observation that many of the phages currently under study were isolated after their release from lysogenic bacteria, more systematic studies have indicated that prophages can be very common. Osawa et al. (2000) found that 51 different functional phages were released from 27 E. coli strains, and Schicklmaier et al. (1998) found that 83 of 107 E. coli strains released at least one functional phage type. Schmieger and coworkers (Schicklmaier et al., 1998; Schmieger and Schicklmaier, 1999) examined 173 Salmonella enterica (serovar Typhimurium) isolates and found that 136 released functional phages. Indeed, the LT2 isolate of S. enterica that is commonly used in laboratory studies carries four intact, fully functional prophages (Yamamoto, 1967; 1969; Figueroa-Bossi and Bossi, 1999; McClelland et al., 2001). Mitomycin C was found to induce the synthesis of functional phages from seven of 170 Yersinia strains (Popp et al., 2000) and phages or phage-like particles from 38 of 68 Gram-positive dairy Streptococcus strains (Huggins and Sandine, 1977). Of course, all such searches find a minimum number of functional prophages, as they depend upon successful induction and use of permissive indicator strains.

Other studies have asked about the presence of particular prophage features in multiple isolates of the same bacterial species. In the E. coli chromosome, the attachment site of the λ-like (lambdoid) phage 21 is occupied by phage-like sequences in 28 of 77 strains examined (Wang et al., 1997), the lambdoid phage Atlas attachment site is occupied in 23 of 72 strains examined (Milkman and Bridges, 1990; Sandt and Hill, 2000), and four of 33 strains examined have something (probably λ-like in two cases) inserted at the phage λ attachment site (Kuhn and Campbell, 2001). Hybridization of DNA from various bacterial strains with authentic phage or prophage DNA probes has shown that related prophages are often present in a substantial fraction of other isolates of the same species [a few of the many such analyses are as follows: Gram-negative enterobacteria (Anilionis et al., 1980; Lindsey et al., 1989; Faubladier and Bouche, 1994; Agron et al., 2001), Wolbachia (Masui et al., 2000) and Haemophilus (Chang et al., 2000); spirochaete Borrelias (Casjens et al., 1997); Gram-positive Streptococcus (Ramirez et al., 1999; Beres et al., 2002; Smoot et al., 2002) and diphtheria-causing Corynebacterium (Pappenheimer and Murphy, 1983)]. Finally, a substantial fraction of searches for strain-specific bacterial sequences for use in the typing of related bacterial isolates have found prophage sequences [e.g. enterobacteria (Emmerth et al., 1999; McClelland et al., 2000), Campylobacter (Dep et al., 2001), Neisseria (Klee et al., 2000), and Lactobacillus (Brandt et al., 2001)]. Clearly, prophages are common in many, widely diverse bacterial species.

A plethora of putative prophages in bacterial genome sequences

In spite of this anecdotal evidence that prophages can be common, their abundance in bacterial genome sequences came as a bit of a surprise to many microbiologists. In the 14 published γ-Proteobacteria genomes, the bacterial phyla with phages that are the best studied and in which prophages are therefore most easily recognized, the number of convincing prophages is high. Eleven of these genomes, those of S. enterica serovars Typhi and Typhimurium, two Yersinia pestis strains, Shigella flexneri, two Xylella fastidiosa strains and four E. coli strains each carry between seven and 20 prophages (Blattner et al., 1997; Simpson et al., 2000; Hayashi et al., 2001a; McClelland et al., 2001; Parkhill et al., 2001a,b; Perna et al., 2001; Deng et al., 2002; Jin et al., 2002; Welch et al., 2002; Van Sluys et al., 2003), and the Shewanella oneidensis, Xanthomonas axonopodis and Xanthomonas campestris genomes contain three, two and one recognized prophages respectively (Heidelberg et al., 2002; da Silva et al., 2002). Bacteria from other phyla also often harbour multiple prophages. For example, among the Gram-positive bacteria, the sequenced genomes of B. subtilis, Clostridium acetobutylicum, Clostridium perfringens, Clostridium tetani, Lactococcus lactis, Listeria innocua, Listeria monocytogenes, Staphylococcus aureus and Streptococcus pyogenes strains all carry multiple, easily recognizable and, in many cases, largely intact prophages (Kunst et al., 1997; Bolotin et al., 2001; Ferretti et al., 2001; Glaser et al., 2001; Kuroda et al., 2001; Nolling et al., 2001; Beres et al., 2002; Shimizu et al., 2002; Smoot et al., 2002; Bruggemann et al., 2003). The phages that infect B. subtilis and L. lactis are the best studied in this rather diverse group. B. subtilis 186 contains three very convincing and largely intact prophages plus at least two smaller possible prophage remnants. Of its three unambiguous prophages, one, SPβ, is a fully functional 134 kbp phage genome (Lazarevic et al., 1999; it is the largest known temperate phage), whereas the other two, PBSX and SKIN, are defective (Krogh et al., 1996; Mizuno et al., 1996). At least two of the six L. lactis IL1403 prophages are fully functional (Chopin et al., 1989; 2001). Prophages can make up a significant fraction of these genomes; E. coli O157 Sakai's 18 recognized prophages make up about 12% of its chromosome (Ohnishi et al., 2001), and the six prophages in Streptococcus pyogenes M3 MGAS 315 make up about 12% of its chromosome (Beres et al., 2002). Phages of other phyla have been studied in less detail, but the spirochaete Borrelia burgdorferi B31's multiple plasmid prophages may constitute as much as 20% of its genome (Casjens et al., 2000). I emphasize that, although it is clear that many prophages are present in bacterial genomes, our current knowledge is far from complete, and some of the interpretations made here may have to be revised in the future. Although prophages are common in bacterial genomes, they have not been found in every individual or in every species. Among the 82 currently published and annotated bacterial genome sequences, 51 harbour apparent prophages and, of these, all but two have integrated prophages. At least 230 prophages are currently recognizable in these 51 genomes. These prophages are listed, along with the genes that they encompass in Table S1 in the Supplementary material. As even the most conserved phage-specific genes (below) are not always recognizable with current methods or might have been deleted, this is a minimum estimate, especially in bacterial phyla in which phages have not been studied in detail. The 31 bacterial genome sequences that contain no recognized prophages are largely clustered at the lower end of the bacterial genome size range (Fig. 1). Two of the smallest genomes that have prophages, B. burgdorferi B31 and Chlamydia pneumoniae AR39, are ‘exceptions that prove the rule’, in that the prophages they harbour are plasmids (Casjens et al., 2000; Read et al., 2000). The absence of integrated prophages in small-genome bacteria could reflect the evolutionary pressure to remove non-essential chromosomal DNA that led to the reduction in the size of their genomes (Lawrence et al., 2001). A few of the larger bacterial genome sequences, for example those of the high G+C Gram-positive bacteria such as Mycobacterium (4.4 mbp) and Streptomyces (9.07 mbp) have relatively few convincing prophages (Fleischmann et al., 2002; Bentley et al., 2002). In addition, P. aeruginosa PAO1 (6.3 mbp) carries only two tail-like bacteriocins, and Sinorhizobium meliloti 1021 (6.7 mbp) has no recognized prophages (Stover et al., 2000; Galibert et al., 2001). In some cases, temperate phages that infect these bacteria are known, making it less likely (but not impossible) that prophages are present in the genomes but remain unrecognized. For example, temperate phage φC31 of Streptomyces has been characterized (Smith et al., 1999), and P. aeruginosa phages are known that are similar to the well-studied E. coli phages λ and P2 [e.g. phages D3 (Kropinski, 2000) and φCTX (Nakayama et al., 1999) respectively]. Perhaps some bacteria have devised mechanisms to avoid such parasites or, by chance, individuals with no integrated phage genomes were chosen for sequencing. It should also be noted that, if laboratory bacterial growth conditions cause frequent induction of a resident prophage, this will impose an artificial selection for derivatives that have lost the prophage. This has apparently happened for the prophages Gifsy-1 and Gifsy-2 in some laboratory strains of S. enterica LT2 (Bunny et al., 2002).

Figure 1.

Putative prophages in sequenced bacterial genomes. The number of recognizable prophages in each of the 82 published bacterial genome sequences is indicated. Closed circles represent genomes with only integrated prophages, and open circles indicate genomes with prophage plasmids (Borrelia burgdorferi B31, 12 prophages; Chlamydia pneumoniae AR39, one prophage). These probably represent minimum prophage numbers, as some may not be currently recognizable. The individual prophages in each genome sequence are delineated in Table S1 (Supplementary material).

The genetic structure of prophages

As nearly 100 complete sequences of fully functional dsDNA tailed phage genomes have been determined, it might seem to be a trivial exercise to search for homologues of known phage genes in bacterial genome sequences and thus identify prophages; however, there are confounding factors. The most important of these factors is the extreme diversity of the dsDNA tailed phages (e.g. Casjens et al., 1992; Hendrix et al., 1999). The phages that infect the enteric bacteria E. coli and Salmonella are the most intensively studied. Yet even today, even the sequence of a ‘new’ phage that is closely related to their well-characterized phages is expected to have novel genes. For example, our recently determined sequence of the genome of phage ES18, a typical lambdoid phage that infects S. enterica (serovar Typhimurium), has about 20 novel genes out of 75 total predicted genes (M. Pedulla, R. Hendrix, G. Hatfull and S. Casjens, unpublished). Prophages in less well-studied bacterial phyla can be expected to contain a majority of novel genes (e.g. 40 of 52 predicted genes in the convincing prophage RadMu in the Deinococcus radiodurans R1 genome have no known homologue; Morgan et al., 2002).

The genomes of most phages that are closely related to one another can be described as having a mosaic relationship, as comparison of any two individuals shows patches of (sometimes very high) sequence similarity separated by non-homologous regions. The notion that such mosaicism has arisen by horizontal transfer of genetic material among the tailed phages has been discussed extensively (Susskind and Botstein, 1978; Botstein, 1980; Campbell and Botstein, 1983; Casjens et al., 1992; Campbell, 1994; 1996; Hendrix et al., 1999; Lucchini et al., 1999; Juhala et al., 2000; Moreira, 2000; Desiere et al., 2001; Brussow and Hendrix, 2002; Lawrence et al., 2002). Such mosaicism is strikingly demonstrated by the relationships among the well-studied phages λ, P22 and N15, all of which have historically been included in the lambdoid phage group. Figure  2 shows that P22 and λ have similar but mosaically related right halves (early regions) but very different left halves (late operon/virion protein genes), whereas N15 and λ have very similar left halves and little similarity in their right halves. A curious result of this is that P22 and N15 are both considered to be lambdoid phages, but they are almost completely non-homologous and only distantly related in their few homologous genes (Ravin et al., 2000). The genetic diversity of phages has only been studied among those that infect the Gram-negative γ-Proteobacteria and the Gram-positive Firmicutes, and these are both far from attaining ‘saturation’. Nonetheless, comparison of phages with very similar transcriptional programmes that infect γ-Proteobacteria, such as the lambdoid phages of E. coli, phages P22, Gifsy-1, Gifsy-2, Fels1 and ES18 of S. enterica (McClelland et al., 2000; Pedulla et al., 2003; S. Casjens, R. Hendrix and M. Pedulla, unpublished), Sf6 and SfV of S. flexneri (Allison et al., 2002; S. Casjens, A. J. Clark, W. Inwood and R. Moreno, unpublished), prophages XfP1 and XfP2 of X. fastidiosa (Simpson et al., 2000), prophage λSo of Shewanella oneidensis (Heidelberg et al., 2002) and phage D3 of P. aeruginosa (Kropinski, 2000) suggest that exchanges among them have taken place such that quite similar genes can be present even in distantly related phages within this group. There have also been recent exchanges of genetic material between very different phages that infect the same host. For example, the E. coli temperate phage λ and large lytic phage T4 have tail fibre assembly genes that are similar in sequence and functionally interchangeable (George et al., 1983; Montag and Henning, 1987). Although genes can be exchanged among distantly related phages with the same host and among phages with different host species, two phages of the same type are more likely (but not guaranteed) to have a higher proportion of more closely related genes if they infect closely related hosts. The lessons for this discussion are that (i) horizontal exchanges are common among the dsDNA tailed phages, so it will not be surprising to find similar mosaic relationships among prophages that are found in bacterial genome sequences; and (ii) prophages in the chromosomes of bacteria that are distantly related to the above two phyla may be very different from known phages and so be much more difficult to recognize.

Figure 2.

Temperate phage genome mosaicism – three ‘unrelated’ lambdoid phages. The genes on phage P22, λ and N15 virion chromosomes are shown with rectangles representing genes; grey rectangles are genes that are transcribed rightward and white are transcribed leftward. The three lytic operons are indicated by arrows below each genome. The ends of each phage's circularly permuted prophage is marked by a black vertical line. Sequence homology is indicated by the light grey areas between genomes.

Recognizing prophages in bacterial genome nucleotide sequence

Some, but not all, phage genome sequences per se have unique properties. For example, some prophages have different G+C contents, oligonucleotide frequencies or codon usage from their host's genome, but this type of analysis has not progressed to the point that it can unequivocally identify prophage sequences (Blaisdell et al., 1996). We must therefore identify prophages in bacterial genome sequences by the similarity of their genes to known phage genes. In spite of the fact that the dsDNA tailed phage genomes encompass an enormous amount of sequence diversity, there are genes that appear to be more highly conserved than others (below). These have and will continue to serve as ‘cornerstones’ for the identification of prophages in bacterial genomes (the range of diversity makes it imperative that sequence searches be done at the encoded protein level, and not at the DNA level). It would be useful if the phage gene families used to identify new prophages in DNA sequence did not have non-phage-encoded members that perform non-phage functions, so that the mere presence of such cornerstone genes can prove that a region of a bacterial genome is phage derived.

Genes in prophages that do not encode virion component

Should phage genes such as those involved in integration, lysis, regulation of gene expression or DNA replication be considered prophage cornerstone genes? Integrases are usually sufficiently conserved to be recognizable, but plasmid prophages do not integrate, and non-phage elements such as plasmids, pathogenicity islands and integrons can carry integrase genes for their own purposes. Thus, although most temperate phages carry an integrase gene, its presence is neither necessary nor sufficient to prove the existence of a prophage. Phage lysis enzymes are often true homologues of chicken egg white lysozyme but may be of other types, such as phage λ endolysin or phage amidases, or may have similarity to other polysaccharide-degrading enzymes such as chitinases (Mediavilla et al., 2000). These proteins can be quite similar, even among distantly related phages, but some bacteria encode ‘autolysins’ that are homologues of phage lysis enzymes. Autolysin genes often appear not to be in a prophage context (e.g. Whatmore and Dowson, 1999; Smith et al., 2000), and such enzymes might be used in normal bacterial cell wall remodelling. It is unknown whether these are ancient prophage relics that have now become useful parts of the bacterial genomes. Every host and many temperate phages encode their own DNA-binding proteins, nucleases, helicases and/or DNA polymerases that function in DNA metabolism and regulatory proteins that control gene expression. The existence of non-prophage bacterial homologues to nearly all these genes shows that they also do not uniquely mark prophages (e.g. Lewis et al., 1998). No host homologue of the transcriptional antiterminators of the λ gene Q family is known, so these might mark some prophages.

Families of homologous phage genes involved in the above processes may or may not form discrete phylogenetic clusters that are separable from their bacterial homologues; however, a very close relationship to a bona fide phage gene is likely to signify that a gene in question is part of a prophage. We will consider two examples, the phage-borne replicon-partitioning proteins and the single-strand DNA-binding proteins (SSBs). The sopA family of plasmid-partitioning genes on the prophage plasmids of E. coli phages N15 and P1 are not particularly close relatives; the N15 SopA protein is 60–75% identical to SopAs encoded by several non-prophage plasmids of enteric bacteria but is only 25% identical to its phage P1 homologue. On the other hand, the S. flexneri lambdoid phage Sf6 SSB protein (S. Casjens, unpublished) is a very close relative (93% identity) of the E. coli phage 1639 SSB (GenBank accession no. AJ304858), but is only moderately closely related to SSBs of E. coli phage P1 (60%) and the non-phage SSBs of enterobacteria (58–62%); it is only distantly related to SSBs of Gram-positive bacteria (22–30%) and their phages (A118, 29%; ΦPVL, 32%). Thus, when members of the same gene family are used in both phage and non-phage contexts, the phage and bacterial genes often do not fall into well-separated lineages. On account of these issues, and variation in DNA metabolism, gene regulation and lysis mechanisms, etc. among phages, the presence of genes for these processes should be considered as supportive but not sufficient evidence for absolute proof of the existence of a prophage.

Virion protein genes as prophage indicator cornerstones

On the other hand, one might expect the genes that encode proteins involved in building the virion to be unique to phages, as bacterial cells are not known to make similar structures for their own purposes (again here I include GTAs and tail-like bacteriocins as ‘prophages’), and this is indeed the case; phage morphogenetic genes usually do not have homologues that are known to perform unrelated functions in other contexts. Therefore, the presence of genes that are closely related to known phage morphogenetic genes in a bacterial genome is, at our current state of knowledge, a virtually unassailable indication of a prophage.

The icosahedral heads of the different tailed phages are extremely similar in physical appearance, although they do have different sizes and some are elongated. Similarly, tails are only known in three general morphotypes – short (e.g. phages P22 and T7), long, contractile (P2 and Mu) and long, non-contractile (λ), although details of tail structure sometimes allow recognition of subtypes within these general tail types (for reviews of phage virion structure and assembly, see Casjens and Hendrix, 1988; Casjens, 1997). However, the proteins that build the various structurally similar virions are at first glance startlingly diverse. For example, scaffolding protein (required catalytically for head shell assembly), proteins at the head–tail junction and proteins at the tail tip/baseplate are very often not recognizably similar among different phages. Even central virion assembly players such as the coat proteins (building block of the icosahedral head shell) are often not recognizably similar. For example, the coat proteins of the very well-studied enterobacterial phages λ, P2, P22, HK97, Mu and T7 are not recognizably homologous even though their heads are virtually indistinguishable in appearance in the electron microscope. It is not known whether such diversity indicates that these are all truly unrelated proteins or whether these proteins are ancient homologues that have diverged to the point of having no recognizable amino acid sequence similarity. The recent observation that HK97 and P22 coat proteins have similar folds supports the latter idea for these two coat proteins (Jian et al., 2003).

Nonetheless, some phage virion assembly proteins are more highly conserved than others, and homology of these genes can often be recognized between phage types. These are as follows: (i) the larger of the two subunits of terminase, the enzyme that cleaves virion-length molecules from concatemeric replicating DNA and is probably part of the motor that drives DNA into the preformed protein capsid; (ii) portal protein, which forms the hole through which DNA is packaged into the capsid and is also part of the packaging motor; (iii) head maturation protease – the assembly of some but not all phage heads is accompanied by assembly-controlled proteolytic cleavage of virion proteins; (iv) coat protein (above); (v) the proteins that build the tail shaft; (vi) tail tapemeasure protein, which determines the length of the tail shaft in the long-tailed phages; and (vii) tail fibres – tail tip proteins that make the initial contact between the virion and bacterial surface. Although the above proteins appear to be more highly conserved than other virion assembly proteins, in no case have all known members of one of these functional protein types been shown to form a single protein sequence family. It is possible that some or all of these may coalesce into single groups as more phage genome sequences are determined.

How confident can we be that weak or tenuous matches to virion assembly genes identify a prophage? The tail fibre proteins and tapemeasure proteins adopt extended, fibrous conformations, and they often contain imperfect amino acid sequence repeats that reflect these structures. These repeats are sometimes found to match other ‘unrelated’ extended proteins such as myosin, collagen, etc., as well as long coiled-coil proteins. For example, some phage tail fibres contain substantial numbers of the collagen Gly-X-Y repeat (Smith et al., 1998). In addition, the sequences of coat proteins, tail shaft proteins and the head maturation proteases are somewhat more variable than the other proteins in this ‘conserved protein’ list. Protease motifs can often be recognized in the latter, but such motifs are not phage specific. For all three of these protein types, similarity is sometimes found between distantly related phages, yet it is not uncommon to find no substantive similarity between otherwise rather close relatives. Probably the most universally conserved and therefore best cornerstone proteins for prophage identification are the large terminase subunit and portal protein. If PSI-BLAST (Altschul et al., 1997) is used to build up related families of terminase and portal homologues from the current sequence database, a small number of currently unconnected families accumulate in both cases, and no convincing matches to these proteins are found that have a known non-phage function. Yet there are a few ‘orphan’ homologues of terminase and portal genes present in bacterial genomes that have no other unequivocal phage genes nearby. For example, the Sinorhizobium meliloti 1021 genome contains an isolated, excellent homologue (gene SMc04187) of the phage P22 large terminase subunit (Galibert et al., 2001), and an orphan portal homologue (gene Spy0555) is present in the Streptococcus pyogenes M1SF370 genome (Ferretti et al., 2001). The functions of these particular genes have not been studied. Are these all that remains of once functional prophages, or might they have other, as yet unknown, non-phage-related roles in these cases? At present, we do not know the answer to this question, but current information suggests that such a lone homologue may well be a relict prophage.

Subjective prophage criteria

Given the immense variation among phages and our incomplete knowledge of that variation, recognition of prophages can be a rather subjective and delicate art, especially as satellite prophages and partly deleted defective prophages may contain no morphogenetic ‘cornerstone’ genes. However, there are less objective criteria that can contribute substantially to our confidence in prophage identification. In spite of their diversity, the temperate phages appear to have settled on a limited number of transcriptional arrangements, and they tend to have operons that are longer than the average E. coli operon, presumably to allow turn-off of the lytic genes by repression at a small number of operators. The latter can be confidence building for prophage identification in bacteria such as E. coli, which have more or less randomly oriented genes, but is less useful in genomes with genes that are largely oriented in the direction of DNA replication such as Clostridium (Shimizu et al., 2002) and Thermoanaerobacter (Bao et al., 2002).

More importantly, phage genomes show striking gene clustering according to general function and ordering according to detailed function within some of the clusters, and genes that encode DNA-interacting proteins usually lie near the DNA target of those proteins. For example, prophage integrase genes are essentially always adjacent to or very near the attachment (integration) site on the phage chromosome, and so they typically mark one end of integrated prophages. Of particular interest here is the observation that, within the gene cluster that encodes the virion assembly proteins, there exists a striking conservation of gene order (Casjens and Hendrix, 1988; Casjens et al., 1992; Hendrix and Duda, 1998). Recombination, replication and control functions are not found in this cluster, although a small number of non-assembly genes appear to have been relatively recently inserted into this operon in some temperate phages (Hendrix et al., 2000). In nearly every tailed phage and prophage with a gene order that is known, the order is ‘terminase – portal – protease – scaffold – major head shell (coat) protein – head/tail-joining proteins – tail shaft protein – tapemeasure protein – tail tip/baseplate proteins – tail fibre’ (listed in the order of transcription). The large lytic phages such as those typified by T4 often have some rearrangements relative to this order, but the order is especially well conserved in the temperate phages. This is shown for the most highly conserved genes in some of the best-characterized phages in Fig. 3. Fifteen to 25 proteins are typically used to build a temperate tailed phage's virion, so the more highly conserved proteins are typically embedded in this order in an apparent operon of this size. The lysis genes usually lie in the same orientation, adjacent to and at either end of the virion protein cluster.

Figure 3.

Conserved genes and gene order in temperate phage morphogenetic operons. The most highly conserved genes in the morphogenetic (late) operons of temperate phages are shown as coloured rectangles; rectangle colours indicate similar functions as labelled. Identical colours do not necessarily indicate sequence similarity; phages are sufficiently diverse that not all proteins of similar function are recognizably homologous (see text). Black circles indicate the location of packaging initiation sites where this is known. A gap between rectangles indicates that there is a gene(s) between them that is not shown in the figure. The black arrow above indicates the direction of transcription for all the genes in the figure except two phage P2 genes, which are indicated to be transcribed in the opposite direction. The functions of most of the indicated E. coli and S. enterica phage genes have been determined directly, whereas the function of most of the genes of the other phages shown in the figure have been deduced by sequence homology.

This is biology, so there are of course exceptions to any rules we might attempt to derive. Some temperate phages such as P22 have short tails and so have no tapemeasure or tail shaft proteins, and the well-studied E. coli phage P2 and its close relatives have inverted terminase and portal genes relative to other phages, and their lysis genes lie between tail genes. But, overall, the above conserved morphogenetic gene order has relatively few exceptions and, when weak matches are present in this order, credence can be lent to otherwise uncertain similarities. An instructive case in point is the family of 30–32 kbp circular ‘cp32’ plasmids found in the spirochaete B. burgdorferi. Each of these plasmids carries a similar, very poorly expressed 22-gene-long putative operon, which at the time of sequencing contained only novel genes (Fraser et al., 1997; Casjens et al., 2000; Ojaimi et al., 2003). As the phage sequence database grew, a moderately weak match (protein BLASTe-value = 3 × 10−8) was found between the second gene from the beginning of these Borrelia operons and a Streptococcus phage φO1205 gene (Stanley et al., 1997). This φO1205 gene, which is located near the promoter-proximal end of the putative morphogenetic operon (the expected position for a terminase gene), is a moderately weak match (e =  5.5 × 10−7) to the well-characterized terminase of B. subtilis phage SPP1. [The transitive nature of such sequence families (A matches B, B matches C, but A does not readily match C) is often a feature of relationships between distantly related phage virion proteins, and transitive matches should be accepted in such searches (see Gerstein, 1998).] Later, when the X. fastidiosa genome was sequenced (Simpson et al., 2000), the protein encoded by the adjacent, transcriptionally downstream Borrelia cp32 gene was found to match very weakly (e = 0.13) a protein encoded at the portal position (immediately downstream of the putative large terminase gene) in X. fastidiosa's convincing prophages XfP3 and XfP4. After two additional rounds of PSI-BLAST alignment, a family of proteins accumulates that includes the putative Borrelia portal proteins (now =  3  ×  10−77) and proteins encoded at the portal position by very unambiguous prophages in S. enterica, Haemophilus influenzae and L. innocua, but no connection to experimentally proven portal proteins is made. In addition, a novel gene near the 3′ end of this Borrelia gene cluster was found to be able functionally to replace a phage λ lysis (holin) gene (Damman et al., 2000). Any of these observations alone does not constitute a very convincing argument that these Borrelia plasmids are or harbour prophages, but the fact that each of these three matches is at the expected location within a phage late operon (see Fig. 3) makes the argument considerably stronger. Finally, Eggers and Samuels (1999) found that cp32 plasmid DNA is present in tailed phage-like particles released from Borrelia, considerably strengthening the argument that these plasmids are indeed prophages (even though 90% of the genes in these putative virion assembly operons have no recognized homologues, and none has been studied in more detail). Although it is impossible to quantify the increase in confidence one obtains when such weak matches occur in the relative positions expected for a phage genome, anecdotal observations like this suggest that increased confidence is nonetheless at least partly justified and can certainly provide impetus for further directed experimental studies.

Highly deleted defective prophages

The evolutionary history of strain-specific elements that have no remaining virion assembly genes can be difficult to deduce, and it may never be possible to know unambiguously if they are in fact really prophage relics. Even in the E. coli K-12 genome, there are elements with origins that remain uncertain. For example, the 22-kbp-long CP4-57 element is inserted into the tmRNA gene, a site at which other more obvious prophages often lie in other bacteria (Table 1) (Kirby et al., 1994; Retallack et al., 1994). It contains an integrase, a functional homologue of the satellite phage P4 orf88 regulatory gene, no obviously non-phage genes and no recognizable homologues to virion protein genes. Similarly, the 34 kbp CP4-6 and 13 kbp CP4-44 elements in K-12 are possible prophages (Blattner et al., 1997; Rudd, 1999). CP4-6 carries an integrase gene at one end, several transposon parts, the arginine metabolism argF gene and a glycosyl hydrolase (the last two have been argued to have arrived in E. coli by relatively recent horizontal transfer; Van Vliet et al., 1988; Garcia-Vallve et al., 1999). Genes in these three regions have a similar codon usage that is different from E. coli (Perna et al., 2002), and these elements are not present in other E. coli strains. All three elements contain genes of unknown function that are homologous to one another and are similarly arranged. These CP4s have been called prophages without qualification in the literature, but their only overt phage homologies are integrase and control genes (Blattner et al., 1997; Garcia-Vallve et al., 1999; Rudd, 1999); genuine proof of phage ancestry awaits the discovery of a true phage with a genome structure that is similar to the CP4s. In the genomes of less well-studied bacteria, it is even more difficult to recognize partly deleted or satellite prophages that contain none of the prophage cornerstone genes.

Prophage evolution and genetic exchange between prophages

Prophages and the bacteria they inhabit have a somewhat precarious mutual existence. From the prophage perspective, many of its genes are not in use and so are not under selection for function. Therefore, mutations, including deleterious ones, can accumulate in these genes resulting in a defective prophage. The host bacterium is under threat of death by prophage induction, and it seems that, in the long term, it would be advantageous from the bacterium's perspective if the prophages were to suffer debilitating mutations, especially if those mutations blocked the ability of the prophage to express its potentially lethal genes (Lawrence et al., 2001). It is therefore not surprising that a large fraction of the prophages that have been identified in bacterial genome sequences appear to be defective (only nine of the more than 200 prophages in Table S1 have been shown experimentally to be fully functional phages). To begin to understand the evolutionary processes that work on prophage DNA, it is instructive to examine specific cases. Two cases will be considered here – the Rac prophages of E. coli (Table 1) and the Pnm prophages of Neisseria meningitidis. These are both α-Proteobacteria, and they may not be representative of all other bacterial phyla. For example, the sequenced Gram-positive Lactococcus, Lactobacillus and Streptococcus genomes contain multiple prophages, but very highly decayed prophages have not been identified there (nonetheless, possible defective prophages such as SF370.4 do exist in Streptococcus pyogenes; Canchaya et al., 2002). It is not yet clear whether this is a sampling difference or if some species might carry only relatively newly arrived prophages and/or have ways of avoiding the accumulation of defective prophages (see also above).

The Rac prophages

Figure 4 diagrammatically compares the prophage entities that lie at the Rac attachment site in the three sequenced E. coli chromosomes. Rac was the first defective prophage to be discovered in E. coli K-12 (Low, 1973; Kaiser and Murray, 1979). In this strain, it was shown that, although no Rac virions were ever produced upon induction, (i) parts of Rac can be picked up by the phage λ chromosome through homologous recombination (Zissler et al., 1971; Kaiser and Murray, 1979); (ii) the Rac prophage can be excised upon induction (Evans et al., 1979; Brikun et al., 1994); (iii) Rac is lethal to the host if expression of its genes is induced, and this lethality results from an inhibitor of host cell division that is homologous to the λ Kil protein (Feinstein and Low, 1982; Conter et al., 1996); (iv) mutations (called sbcA) in Rac can restore homologous recombination in recBrecC mutants by expressing the prophage's RecE function (Fouts et al., 1983; Willis et al., 1985); and (v) these sbcA mutants also express a function, Lar, that enhances EcoKI-mediated DNA methylation (similar to λ Ral function) (King and Murray, 1995). The sbcA mutations are thought to turn on the non-lethal part of the Rac prophage early left operon rather than altering RecE and Lar functions directly (Mahajan et al., 1990), indicating that the lar and recE genes are functional but unexpressed in the Rac prophage. The K-12 genome sequence confirmed that Rac is indeed a lambdoid prophage that has lost about 60% of its original DNA (Blattner et al., 1997). Its early left operon contains the recE gene at the position in which other lambdoid phages carry their genes for homologous recombination. More recently, the fully functional Salmonella phages, Gifsy-1 and Gifsy-2, have been found to carry recE homologues in similar positions in their early left operons (McClelland et al., 2001), suggesting that the recE gene is most likely an authentic part of the original Rac phage. In addition, it is likely that the Rac repressor and integrase still function, as conjugational transfer induces gene expression from the prophage and causes excision (Evans et al., 1979; Feinstein and Low, 1982). Rac's right arm has not fared as well (Fig. 4); deletions have removed at least (i) the region between the DNA replication and lysis genes; (ii) the head and upstream tail genes (equivalent to λ genes nu1 to G-T); and (iii) the tail tip genes (equivalent to λM to J). In addition, two transposons now reside in its right arm, one of which disrupts a homologue of the λlom lysogenic conversion gene. There are four obvious pseudogenes in the right arm, the interrupted lom gene and truncated b1361, tail tapemeasure (H) and lysis (Rz) genes. Of course, it is not possible to tell whether any open reading frame that has not been studied experimentally but is approximately full length relative to other homologues is in fact functional, so this is the minimum number of defective genes.

Figure 4.

Three λ-like E. coli Rac prophages. Prophages Rac (E. coli strain K-12), Sp10 (strain Sakai) and CP-933R (strain EDL933) are located at identical positions in the three genomes. Genes and predicted genes are indicated by rectangles; black, genes outside the prophage; white, prophage genes that are transcribed to the left; grey, prophage genes that are transcribed to the right; cross-hatched, genes that currently have no homologues in other phages or prophages and so could in theory have been inserted since the original phage genome integrated at this site (see text). Below is a scale in kbp and arrows that indicate the major operons of the prophages as predicted by homology with other better characterized lambdoid phages. Cross-hatching between the three prophages marks regions of nucleotide sequence similarity; in some sections, the percentage identity is given. The labels for the various genes indicate known function or putative function as deduced from homology relationships. Open circles indicate apparent pseudogenes that have obviously been inactivated by mutation; closed circles indicate genes that have been shown to be functional in Rac; and closed triangles indicate deletions relative to known lambdoid infectious phages.

Curiously, immediately to the right of Rac's Rz homologue, the trkG gene for potassium uptake (Dosch et al., 1991; Schlosser et al., 1991) lies in a region that is very variable among the lambdoid phages and is not known to carry essential genes (for the phage). Was the trkG gene part of the original prophage or was it moved into this location subsequent to the phage's original integration? To date, no functional phage is known that carries a trkG homologue. The huge diversity of phages makes it difficult to even guess whether such a putative prophage gene, which has not yet been found on other phages, was or was not part of the phage that integrated to form the original prophage. The trkG gene in Rac and the argF homologue in CP4-6 (above) are such cases in point, but both are redundant to other genes with the same function in K-12 and so may be recent arrivals. Our (admittedly not exhaustive) analysis of the prophages in Table S1 suggests that there are few compelling examples of putative non-phage genes that have moved into a prophage after its integration. It seems inevitable that some non-phage genes would end up inside defective prophages during rearrangements that might accompany the decay process, and the frequency of such events could vary among hosts but, nevertheless, such events appear to be rare in prophages that have not yet decayed into unrecognizability.

More recently, the genomes of two closely related O157-type E. coli strains, EDL933 and Sakai, have been sequenced (Hayashi et al., 2001a; Perna et al., 2001) that carry a prophage located precisely at the Rac attachment site (Table 1); in EDL933, it was named CP-933R and, in Sakai, it was named Sp10 (Fig. 4). In a fourth E. coli strain, CFT073 (Welch et al., 2002), all that remains at this attachment site is 320 bp (including a C-terminal fragment of an integrase gene) that are 98.4% identical to the left end of the above three prophages. It thus appears that a related prophage once occupied the Rac attachment site in CTF073 but, as it has been nearly completely deleted, it will not be discussed further here. CP-933R and Sp10 are similar to one another, but are not identical. Both have lengths similar to known lambdoid phages (which range from about 39 kbp to 62 kbp). They are typically mosaic lambdoid genomes, with many homologues of known lambdoid phage genes arranged with the ‘correct’ clustering, order and orientation. Neither contains any genes that are clearly related to ‘non-phage’ genes, and both contain a few obvious pseudogenes. Among the essential virion assembly genes, the Sp10 putative coat protein gene contains a frameshift relative to several other prophages in these strains. CP-933R has head and tail genes that are similar to phage λ and, using λ gene nomenclature, its essential genes E, V, H, I and J are truncated or contain frameshifting mutations, and genes FI, FII, Z and U are missing. Thus, neither prophage is expected to be able produce viable virions upon induction, and they appear to have had different mutational histories since their arrival at this location. As in Rac, the left arm of these two prophages appears, at this level of analysis, to be largely intact. The leftmost 21 kbp are> 99.9% identical in CP-933R and Sp10, and their leftmost 8 kbp are 99.0% identical to the K-12 Rac prophage.

Are Rac, CP-933R and Sp10 the result of integration by different phages at the same bacterial attachment site, or are they descendants of the same progenitor prophage? Independently isolated phages with identical integration specificities are known so, at first glance, the former scenario seems plausible, as the central regions of Sp10 and CP-993R are not closely related. The head genes of Sp10 are very similar to those of prophages Sp6, Sp9 and Sp12 (which are not close relatives of any experimentally studied phage). Those of CP-933R are very similar to head genes of phages λ and 21 and also closely related to genes in the CP-933Od portion of the complex CP-933O prophage (see Table S1). (‘Sp’ prophages are in E. coli strain Sakai and ‘CP’ prophages are in strain EDL933.) This could be interpreted as evidence for independent origins for Sp10 and CP-933R; however, the lambdoid phages are so diverse that very rarely, if ever, have any two independently isolated infectious lambdoid phages been found that are so nearly identical over such an extended region as Sp10 and CP-933R are at their left and right ends. HK97, 434 and λ integrate at the same site, as do Sf6 and HK620, and these phages do not have this ‘similar prophage ends with different central regions’ relationship; they are typically mosaically related with nearly identical integrase genes (Juhala et al., 2000; S. Casjens, A. J. Clark, W. Inwood and R. Moreno, unpublished). Thus, independent integration at the Rac attachment site by two different progenitor phages with such similar genomes seems an unlikely event.

The deletion in CP-933R that has end-points in its λE and V gene homologues (Fig. 4) contributes to a stronger argument that genes have, in fact, been exchanged among prophages within these bacteria. This deletion is also present with exactly the same end-points (between genes Z2136 and Z2137) in the EDL933 prophage CP-933Od. It is very unlikely that identical deletions happened independently in CP-933R and CP-933Od, so one of these head regions was apparently replaced by a copy of the other after the deletion occurred. It is also unlikely that this deletion (which removes six essential genes) would be present in an infecting phage virion's DNA. We cannot be absolutely sure, but it therefore seems most reasonable to propose that CP-933R and Sp10 are in fact descendants of the same original prophage, and that either (i) in EDL933, the head genes of the original prophage at this site were replaced by a copy of the deletion-carrying head genes from CP-933Od; or (ii) in Sakai, the phage λ-like head genes of the original prophage were replaced by a copy of those from Sp6, Sp9 or Sp12. Although such recombination acts could be seen as ‘homogenizing’, the recipient carries a new overall combination of alleles not present in the parent prophages. As there is a very low probability of two phage DNAs of independent origin having such extended regions of nearly identical nucleotide sequence integrating into the same chromosome, such identity, when present, could conceivably constitute tentative evidence for such duplicative exchanges. For example, the 14317 bp of identity between prophages XfP3 and XfP4 in X. fastidiosa 9a5c and the over 4000 bp of identity between the Gifsy-1 and Gifsy-2 prophages’ DNA replication–Nin regions in S. enterica LT2 suggest that such exchanges may also have occurred in these cases.

Even more surprising is the observation that the same type of relationship as is seen between CP-933R and Sp10 (extremely similar outside regions with very different central regions) is found to be common when other cognate prophage pairs in EDL933 and Sakai are compared. Prophage pairs Sp14/CP-933U, Sp4/CP-933M and Sp15/CP-933V all have this type of relationship (Fig. 5). For example, lambdoid prophages Sp14 and CP-933U are both integrated into the same site within the serU tRNA gene. These two prophages have about 12 kbp of 99.2% identity at their tail gene ends and 16 kbp of 99.9% identity at their integrase ends. Between these long-terminal similarities, they have> 10 kbp of sequence where little similarity can be found. This central part of Sp14 contains an 8 kbp section of the head genes that is 99.8% identical to the head gene region of Sp4. If it is unlikely that two phage genomes with this type of sequence relationship happened to have integrated independently at the Rac attachment site in these two strains (above), then it is all the more unlikely that these four prophage pairs would also have such a relationship. Furthermore, the identical deletion of DNA between the coat and tail shaft genes that is present in CP-933R and CP-933Od in EDL933 is present in Sp4 and Sp14 in Sakai. As none of these four is a cognate prophage (i.e. integrated at the same site in these two strains), this deletion appears to have occurred in a common ancestor of EDL933 and Sakai and then moved between prophages several times after their divergence. The relative abundance of this type of duplicative rearrangement between prophages in these two isolates suggests that interprophage homologous recombination may occur much more frequently than previously imagined, and that such events could well be an important route by which new temperate phage allele combinations are formed.

Figure 5.

Central region shuffling among E. coli O157 prophages.
Top. Five E. coli O157 Sakai prophages are indicated by coloured rectangles. Bottom. The E. coli O157 EDL933 prophages integrated at cognate sites are similarly indicated. The host gene at the site of integration is shown between cognate prophages. All five cognate pairs have outer regions that are extremely similar (in most cases> 99% identical). The colours of the central sections of the prophages indicate their sequence relationships in the head gene regions, and the asterisk (*) indicates the presence of the deletion that ends in the coat and tail shaft protein genes (see text). Similar colours indicate nucleotide sequences that are> 93% identical. The central (head) regions indicated by different colours are not close sequence relatives; the closest is about two-thirds of the Sp15 head region, which is about 75% identical to that of CP-933U, and the others are much more distantly related. Rectangle sizes are not proportional to DNA length, and the situation is actually more complex than the diagram indicates in that some of the non-head gene regions of the central non-homologous parts of cognate prophages have different relationships from the indicated head genes. Prophage CP-933X contains the remaining unsequenced section of the strain EDL933 genome.

Pnm2 and Pnm3 prophages

Neisseria meningitidis cognate prophages Pnm2 in strain Z2491 and NeisMu1 in strain MC58 (Parkhill et al., 2000; Tettelin et al., 2000) are mosaic relatives of the Mu-like group of phages [E. coli phage Mu and three largely intact prophages, FluMu, Sp18 and Pnm1, present in H. influenzae Rd, E. coli Sakai and N. meningitidis Z2491, respectively, have been completely sequenced (Fleischmann et al., 1995; Parkhill et al., 2000; Hayashi et al., 2001a; Morgan et al., 2002); ‘NeisMu1’ is a provisional name used here as the original annotators did not name this element]. This type of phage integrates essentially randomly by a transposition mechanism (reviewed by Harshey, 1988). Thus, as the number of potential integration targets in any genome is huge, natural prophages of this type that are found at identical positions in the genomes of two independently isolated bacteria are extremely likely to be descendants of the same past phage integration event. Pnm2 and NeisMu1 occupy precisely the same integration site within an ABC-type transporter gene (Fig. 6). In both prophages, the reading frames of the two transporter gene halves seem to be essentially intact (97.4% identical in nucleotide sequence); however, the N-terminal fragment of the strain MC58 gene contains a frameshift mutation. These prophages are both certainly defective, and their deletion histories are different. For example, Pnm2 appears to have suffered an ≈ 9 kbp deletion in the tail region, and NeisMu1 has a major deletion in its middle gene region and a shorter deletion of the putative coat protein gene.

Figure 6.

Defective Mu-like Neisseria meningitidis prophages. Defective prophages Pnm2 and Pnm3 and NeisMu1 and NeisMu2 in N. meningitidis prophages in strains Z2491 and MC58, respectively, are shown as in Fig. 4. Below each prophage, selected genes are marked by the gene number of the homologous phage Mu gene and/or a predicted function (Morgan et al., 2002). Grey arrows connect genes that have similar predicted function but not sequence similarity. Black bars marked A, B or C denote regions where more detailed comparisons were made (see text).

As in the case of Sp10 and CP-933R above, differential DNA replacements appear to have occurred after integration. An example of such a replacement is near the left end of the two prophages, where Pnm2 and NeisMu1 have unrelated genes, the best matches of which are other transcriptional repressors, at the position where other Mu-like phages encode repressors. In general, the homologous genes in Pnm2 and NeisMu1 are about as different from each other as are chromosomal backbone genes in N. meningitidis MC58 and Z2491. Sections A and B (Fig. 6) of Pnm2 and NeisMu1 are 99.3% and 97.4% identical in nucleotide sequence, respectively, and the three intact chromosomal genes adjacent to the left end of NeisMu1 in MC58 are 97% identical to the same genes in strain Z2491. This is consistent with the notion that NeisMu1 and Pnm2 have been diverging for the about same length of time as the chromosomes in which they reside. Sections A and B are in the head and tail gene clusters, respectively, neither of which should be under selection for function in the prophage. N. meningitidis is a naturally competent bacterium, in which DNA uptake is mediated through a DNA uptake sequence (Goodman and Scocca, 1988). Both Pnm2 and NeisMu1 do contain this sequence greatly over-represented, so it is impossible to know whether one of the putative repressor genes entered the prophage from an infecting phage, another prophage (now gone) or from transforming phage or prophage DNA.

Also present at identical locations in N. meningitidis Z2491 and MC58 is another region that is probably a more highly decayed Mu-like prophage called Pnm3 and NeisMu2 in the two strains respectively (Fig. 6). These are much more highly deleted than Pnm2 and NeisMu1 (they retain only 15–20% of their putative original DNA), and so are likely to have been decaying for a longer period of time – yet they are 97.5% identical to each other in region C (Fig. 6). This is consistent with the divergence of Z2491 and MC58 after this element started to decay. These two prophages highlight the use of gene order and clustering in recognizing highly deleted prophages. The only match to an authentic phage gene in Pnm3 and NeisMu2 is the presence of a homologue of Mu gene 16 (also called gemA). The presence of a single phage-like gene, especially a regulatory gene (Ghelardini et al., 1994) such as this one, cannot be considered unequivocal evidence of a prophage. However, there are a number of genes in Pnm3/NeisMu2 that are similar to otherwise novel open reading frames present in Pnm2/NeisMu1 (and Pnm1, another largely intact Mu-like prophage in Z2491; Klee et al., 2000). As homologues to these genes are not present outside these prophages, and as they are present in the same order in each of the putative prophages, it can be rather firmly concluded that Pnm3 and NeisMu2 are real but highly deleted prophages.

The complex decay of prophages

It might have been expected that derelict prophage DNAs would be in a straightforward mutational ‘free fall’ in which inactivating mutations occur at random until the prophage is completely eliminated. Lysogenic conversion (or possibly other) integrated prophage genes may be advantageous to the host and be kept functional by selection as the rest of the prophage decays into oblivion, and so they may eventually be appropriated as integral parts of the host chromosome. Examples of possible intermediates in this assimilation process might be some pathogenicity islands and the Shigella dysenteriae Shiga toxin that is encoded by a small prophage remnant (McDonough and Butterton, 1999). Likewise, plasmid prophages might evolve into plasmid replicons.

However, the situation is certainly much more complex than this. Understanding prophage evolution and decay is significantly complicated by possible excision and subsequent replacement by another, possibly related phage genome, as well as by homologous recombination with infecting phage genomes and other prophages in the same cell. Infecting phages can clearly acquire genetic information from prophages in cells they infect (e.g. Kaiser, 1980; Espion et al., 1983; Bouchard and Moineau, 2000). However, as prophages express immunity and superinfection exclusion systems that can allow cell survival after a superinfecting phage has injected its DNA (Susskind et al., 1974; Susskind and Botstein, 1980), transfer of information from infecting phage DNA to related prophages might occur as well. Studies of the DNA sequences present in different E. coli isolates at the phage 21, λ and Atlas attachment sites have found evidence for different entities being present at each site in some different strains (Milkman and Bridges, 1990; Wang et al., 1997; Kuhn and Campbell, 2001), so complete excision and replacement is certainly plausible. Although such findings could be interpreted to support the idea that, in a given bacterial lineage, prophages come and go (perhaps frequently?; Campbell, 1996), the arguments presented above suggest that complete replacement may be less common than other types of genetic interactions. Some prophages may in fact spend rather long times in residence in bacterial chromosomes before being completely removed. At least large parts of Proteobacterial prophages such as Rac and Pnm2, which are still quite far from complete assimilation, appear to have been in residence at least long enough for bacterial genes to diverge about 1.5 and 3% respectively. (The genes of E. coli K-12, for example, are on average 98.5% and 98.3% identical to those of EDL933 and Sakai respectively; Perna et al., 2002.) If estimates of divergence rates in bacteria are correct (Ochman and Wilson, 1987; Reid et al., 2000), this suggests that these prophages may have been in place for as long as several million years. This does not mean that all prophages have such long residence times, and the suggestion of such antiquity for some is rather speculative. Comparison of additional genome sequences will help to decide upon its accuracy.

Analysis of such decaying prophages clearly shows that point mutations, transposon insertions and deletions all occur. Interestingly, it appears that, as prophages decay, prophage-debilitating deletions can accumulate more rapidly than gene-inactivating point mutations, as numerous genes, even in moderately highly deleted prophages such as Rac, remain functionally intact. The functionality of many normally unexpressed genes in defective prophages has been demonstrated in the laboratory through mutations that turn on their expression (Willis et al., 1985; Blasband et al., 1986; Bejar et al., 1988; Mahajan et al., 1990), recombination onto a related phage that depends upon that function (Kaiser, 1980; Espion et al., 1983; Bouchard and Moineau, 2000) or expression of functional proteins from a cloning vector (Morimyo et al., 1992; King and Murray, 1995; Jin et al., 1996; Mahdi et al., 1996). This lack of debilitating point mutations could be the result of random failure to be inactivated by mutation or be due to selection for function if the genes are in fact weakly expressed and have a function in the lysogen. The latter seems unlikely, except for lysogenic conversion genes, given current knowledge about gene expression from prophages. On the other hand, inactivated genes could be repaired to full functionality by recombination with other prophages or with infecting phages.

It has long been known that homologous recombination between lambdoid prophages is possible in the laboratory (Meselson, 1967; Redfield and Campbell, 1987), but simple, single break-and-join recombination events between non-tandem prophages integrated in the same chromosome would result in inversion or deletion of the intervening DNA. Such events could be detrimental to the host; however, one such inversion event does appear to have occurred that involved prophages CP-933O and CP-933P in E. coli strain EDL933 (Perna et al., 2002). Non-reciprocal double break-and-join or long gene conversion events could replace parts of one prophage with sequences from another prophage. Either mechanism could create relationships such as those observed between the prophages in Fig. 5 where multiple prophages within a bacterium contain sections of nearly identical sequence. Such duplicative replacement events among prophages should not distinguish between functional and non-functional genes, and so would be just as likely to replace a functional gene with a non-functional one as vice versa. On the other hand, replacement of part of a prophage by part of an infecting phage genome would be more likely to repair damaged prophage genes to functionality, as genes on an infectious phage genome have presumably been under recent selection for functionality. Nonetheless, at present, we cannot know whether (for example) the apparently functional left early operon of Rac was left intact by chance, was somehow selected to remain functional or is currently functional because of recent repair from another prophage (since lost) or an infecting phage. We can, however, conclude that, even if a prophage is defective, it is not necessary that all its genes are doomed to be lost forever. As many of their genes retain functionality and remain accessible to the phage population, and as phage virions may only be in 10-fold excess over bacterial cells in the environment (Bergh et al., 1989), prophage genes constitute a significant portion of the ‘phage gene pool’ in the earth's biosphere.

Comments on the identification and annotation of prophages

In order to understand fully the true nature of bacterial genomes, we must be able to recognize prophages in nucleotide sequence; however, the extreme variability of phage nucleotide sequences makes it quite possible that unrecognized prophages still lurk in bacterial genome sequences. The ‘gold standard’ of prophage recognition is and should remain high similarity of sequence and gene organization to authentic temperate phages that infect the same bacterial species. In addition, (i) recognition of the conserved nature of some dsDNA tailed phage morphogenetic proteins such as portal and terminase; and (ii) the observation that these proteins do not have homologues with known non-phage functions has made the recognition of many prophages, even in distantly related bacterial genome sequences, quite unambiguous.

Can our ability to recognize prophages and annotate their sequences be improved? Yes. Most importantly, the study of additional infectious tailed phages, especially those that infect the less well-studied phylogenetic branches of bacteria, will help to ‘fill in the current gaps in sequence space’ and so make prophages more easily recognizable in those phyla. Hopefully, this will eventually lead to a situation in which at least the most highly conserved phage proteins will form one or a few (transitive) sets of related sequences that will include, for example, all the known terminases or portal proteins and will contain recognizable homologues of all subsequently sequenced members of those families. But such cornerstone genes may not be present in authentic but defective prophages or satellite prophages; how can we recognize these with higher accuracy and confidence? Several simple things can be done now.

  • (i) As relatively few ‘non-phage’ genes appear to have moved into the known prophages after integration, it seems justified at this point for bacterial genome annotators to indicate that ‘hypothetical’ (novel) or ‘conserved hypothetical’ (have a homologue of unknown function in the database) genes within apparent prophages are ‘putative prophage genes’. To date, some bacterial genomes have been annotated in this manner, whereas many have not. If this were universally done, it would be much easier to determine whether conserved hypothetical genes in new prophages are present elsewhere in other prophages. If they are, and especially if they are present in the same order, they probably represent the remains of another prophage. When this logic is applied to the X. fastidiosa 9a5c genome, for example, at least five more highly deleted putative prophage remnants are found in addition to the four prophages identified in the original genome report (Simpson et al., 2000) (Table S1).
  • (ii) Many phage genes have specific, very well-understood functions, so possible prophage genes with homology to these should be annotated with their specific presumed function, not just ‘phage-related protein’ as is often done currently.
  • (iii) Prophages very often ‘repair’ the bacterial gene into which they integrate by carrying a similar replacement part on the phage genome that gets fused to the target gene upon integration (Campbell et al., 1992; Campbell, 1994). Even when this does not occur, the identity between the phage and bacterial attachment sites is usually ≥ 10 bp long. Thus, there are typically exact direct repeats tens of basepairs long at the prophage boundaries [e.g. 10–148 bp in the various E. coli Sakai prophages (Hayashi et al., 2001b); such repeats can be even longer and need not be perfect throughout their length (Campbell et al., 1992)]. Genome annotators and analysers should attempt to locate and report such repeats, as finding these features identifies the outside boundaries of the prophage with precision.
  • (iv) A strong argument for integrated phage DNA (or any mobile DNA element) is its absence in some other strains. This may be subject to exception if there has been a recent population bottleneck in a species or if phages are so abundant that ancestors of every extant bacterium acquired a prophage at a given attachment site. As many genome sequencing operations have an interest in using their sequence information to examine genomic variation within species, this author recommends that, whenever possible, if a prophage is tentatively identified in a new bacterial genome sequence, the sequencers check for its absence in other strains. This can be done by DNA array analysis, but this approach has the disadvantage that it can be fooled by the not unlikely occurrence of similar prophages at different locations in other strains. A more informative approach is polymerase chain reaction amplification across the putative attachment site in other strains and sequencing the amplified product, if it is made, that is expected when no prophage is present. This would both help to confirm a sequence region as a prophage and precisely locate the prophage attachment site and prophage ends with confidence, which in turn would make the assignment of prophage genes much more robust.
  • (v) Finally, annotators should give names to putative prophage elements in bacterial genome sequences. This may seem a trivial point, but it has not been done in many of the published genome sequences, and the lack of names makes it difficult for others (who are reluctant to name them themselves) to deal with them in print. Prophages at cognate sites in different strains of the same species should not be given the same name, as they are probably not identical.

If these were universally implemented, it would make global analysis of prophage sequences much easier, which in turn would make annotation much more accurate and our understanding much more sophisticated.

Prophage sequences and bacteriophage diversity

For those who are interested in understanding the range of diversity of phages on earth, the sequenced prophages represent a wealth of information that cannot be ignored, as more prophage sequences have been determined than have sequences of bona fide infectious phages. For example, it is currently possible to use the prophage sequences to learn about the different types of non-homologous (convergent) gene modules that are used for a particular function by a group of temperate phages. A few such cases are as follows. (i) The E. coli RecE-type homologous recombination function was first found in the defective Rac prophage and only subsequently found in other infectious lambdoid phages (above). This recombination system uses the recE and recT genes of prophage Rac but, on other phages, we can recognize only a recT homologue [e.g. B. subtilis phage SPP1 (Alonso et al., 1997) and L. monocytogenes phage A118 (Loessner et al., 2000)] or only a recE homologue (e.g. S. enterica phages Gifsy-1 and Gifsy-2; McClelland et al., 2001). This suggests that these phages may have another non-homologous protein that replaces the missing partner. (ii) The lambdoid phages P22 and λ have convergent replication genes – the λ gene P protein recruits the host DnaB helicase to the replication initiation complex, whereas the cognate, non-homologous P22 gene 12 protein is a homologue of the host DnaB protein that does the helicase job itself (Wickner, 1984a,b). An E. coli dnaC homologue was first seen in the Rac prophage at this location and, recently, the lambdoid phages Gifsy-1 and Gifsy-2 have been found to carry a clear dnaC homologue in their DNA replication region. E. coli DnaC protein is a helicase loader, so perhaps these DnaC homologues act in the same way as phage λ P protein? In addition, the lambdoid S. enterica LT2 prophage Fels-1 encodes a novel protein that contains a primase motif in its replication gene position; does Fels-1 use a new, unstudied type of lambdoid phage replication initiation? (iii) Most lambdoid phages carry a homologue of the λRz lysis gene adjacent and downstream of their endolysin gene. The K-12 prophage QIN has a different, novel gene in this position, and phage N15 was subsequently found to have a homologue of this QIN protein in the same location (Ravin et al., 2000). Does this gene represent a functional alternative to Rz function? (iv) Several prophages in the E. coli genome sequences that are lambdoid in other respects have head and/or tail genes (as deduced from their position within the prophage) that are unrelated in sequence to any previously studied virion assembly genes, and lambdoid prophages found in the genomes of Wolbachia (Masui et al., 2000; 2001) and X. fastidiosa (Simpson et al., 2000) have tail genes that are homologous to genes that encode contractile tails in other phages (all previously characterized lambdoid phages had non-contractile or short tails). More recently, E. coli and S. flexneri lambdoid phages φP27 and SfV were found to have contractile tail genes (Allison et al., 2002; Recktenwald and Schmidt, 2002). Clearly, the sequenced prophages are an excellent place to find variations on temperate phage lifestyle themes.

Finally, we can learn about the overall variety of types of temperate phages from the examination of prophage sequences. A dramatic example of this may be indicated by genes homologous to the RNA polymerase gene of virulent E. coli phage T7 in the X. axonopodis 903 genome (da Silva et al., 2002) and in the Pseudomonas putida KT2440 (Nelson et al., 2002). In both cases, homologues of phage head and tail genes lie nearby, supporting the notion that these putative RNA polymerase genes are parts of prophages PP03 in P. putida and XacP2 in X. axonopodis (Table S1, Supplementary material). If true, this would be a completely new type of temperate phage, as no temperate phage is currently known to encode its own RNA polymerase. Many such discoveries no doubt await the careful analysis of the numerous prophages present in bacterial genome sequences.


The author's research is supported by NSF grant MCB990526 and NIH grant AI49003. I thank Roger Hendrix and Jeff Lawrence for reading this manuscript and for many productive discussions of phage biology and evolution, and Thad Stanton, Kenn Rudd, Guy Plunkett and Nicole Perna for access to unpublished information.

Supplementary material

The following material is available from

Table S1. Prophages and phage-like objects in 82 published bacterial complete genomes.