Whole genome comparison of six Crocosphaera watsonii strains with differing phenotypes


Author for correspondence: e-mail jpzehr@gmail.com.


Crocosphaera watsonii, a unicellular nitrogen-fixing cyanobacterium found in oligotrophic oceans, is important in marine carbon and nitrogen cycles. Isolates of C. watsonii can be separated into at least two phenotypes with environmentally important differences, indicating possibly distinct ecological roles and niches. To better understand the evolutionary history and variation in metabolic capabilities among strains and phenotypes, this study compared the genomes of six C. watsonii strains, three from each phenotypic group, which had been isolated over several decades from multiple ocean basins. While a substantial portion of each genome was nearly identical to sequences in the other strains, a few regions were identified as specific to each strain and phenotype, some of which help explain observed phenotypic features. Overall, the small-cell type strains had smaller genomes and a relative loss of genetic capabilities, while the large-cell type strains were characterized by larger genomes, some genetic redundancy, and potentially increased adaptations to iron and phosphorus limitation. As such, strains with shared phenotypes were evolutionarily more closely related than those with the opposite phenotype, regardless of isolation location or date. Unexpectedly, the genome of the type-strain for the species, C. watsonii WH8501, was quite unusual even among strains with a shared phenotype, indicating it may not be an ideal representative of the species. The genome sequences and analyses reported in this study will be important for future investigations of the proposed differences in adaptation of the two phenotypes to nutrient limitation, and to identify phenotype-specific distributions in natural Crocosphaera populations.

Crocosphaera watsonii is a unicellular nitrogen (N2)-fixing cyanobacterium that is widely distributed throughout tropical and subtropical oligotrophic oceans. In those regions, the low level of bioavailable nitrogen (N) often limits primary production, and N2-fixing (i.e., diazotrophic) phytoplankton can be an important source of N for the phytoplankton community. A variety of studies have demonstrated that unicellular diazotrophic cyanobacteria, especially Crocosphaera and UCYN-A, are abundant and contribute substantial amounts of N in many oligotrophic regions (Zehr et al. 2001, Falcon et al. 2004, Montoya et al. 2004, Church et al. 2005, 2008, Langlois et al. 2008, Kitajima et al. 2009, Moisander et al. 2010). Crocosphaera strains, all of which are the species C. watsonii, have been successfully isolated from multiple ocean basins and maintained in culture for many years. Although these strains exhibit phenotypic differences, genetic comparisons have found the vast majority of DNA sequences to be essentially identical among cultivated strains and environmental sequences (Zehr et al. 2007, Bench et al. 2011). In the context of such high levels of sequence conservation, C. watsonii strains appear to diverge and maintain genetic diversity through genetic rearrangements and by incorporating strain-specific sequences (Zehr et al. 2007, Bench et al. 2011). Large numbers of mobile genetic elements (i.e., transposase genes) in the C. watsonii WH8501 genome provide a possible mechanism for such genetic insertions, deletions, and rearrangements (Bench et al. 2011). Crocosphaera are distinguished by these characteristics from sympatric non-N2-fixing marine cyanobacteria, such as Synechococcus and Prochlorococcus, which generally lack transposase genes and have a high degree of genomic sequence diversity in cultured strains and environmental sequences (Coleman et al. 2006, Rusch et al. 2007, Zhao and Qin 2007, Dufresne et al. 2008, Scanlan et al. 2009, Partensky and Garczarek 2010).

Physiological studies of cultivated and natural populations of Cwatsonii have identified a number of genetic strategies which appear to be adaptations to the oligotrophic environment. These include regulation of gene expression, nitrogen fixation rates, and cellular protein content in response to changes in nutrient (e.g., iron and phosphorus) levels and other environmental variables (Webb et al. 2001, Tuit et al. 2004, Falcon et al. 2005, Dyhrman and Haley 2006, Fu et al. 2008, Hewson et al. 2009, Compaoré and Stal 2010, Shi et al. 2010, Saito et al. 2011). Currently, cultivated Cwatsonii strains can be divided into two broad phenotypic categories: (i) those that produce large amounts of exopolysaccharide (EPS) and have larger cell diameters (over 4 μm), and (ii) those that do not produce noticeable EPS, and have cell diameters less than 4 μm (Webb et al. 2009, Sohm et al. 2011). The most apparent difference between the two phenotypes in culture is that the large-cell strains produce over 10 times the amount of EPS as the small-cell strains (Sohm et al. 2011). While it is not well understood why Crocosphaera sp. produce EPS, studies on other species have shown that EPS production can have cell protective properties (Pereira et al. 2009), and can also enhance carbon export from surface waters in the form of marine snow (Passow et al. 2001, Sohm et al. 2011). A recent genomic comparison of two Crocosphaera strains, one of each phenotype, identified a region in the large-cell type genome that is likely to play an important role in EPS production (Bench et al. 2011). This region contained 25 genes, many of which encoded functions related to EPS biosynthesis, and all of which were absent from the small-cell type genome. The two phenotype groups have additional, ecologically relevant differences in phosphorus scavenging gene content, growth temperature optima, per-cell nitrogen fixation rates, and photosynthetic efficiency (Dyhrman and Haley 2006, Webb et al. 2009, Sohm et al. 2011).

To better understand the genetic basis of the C. watsonii phenotypes, this study compared the genomes of six strains, three in each phenotypic group, isolated over decades from multiple ocean basins. This comparison included examining evolutionary relationships among strains and identifying genomic features and metabolic capabilities that are unique to strains and phenotypes.

Materials and Methods

Strain growth and genomic DNA isolation and sequencing

The phenotypes, isolation location and genome GenBank accession numbers for C. watsonii strains described in this study are listed in Table 1 and all strains have been previously described (Waterbury et al. 1986, 1988, and Webb et al. 2009). All strains were grown in nitrogen-free SO medium (Waterbury et al. 1986, 1988) in polycarbonate tissue culture flasks with a 0.2 μm pore-size vent cap (Corning Inc., Corning, NY, USA) at 26°C under a 12:12 h light/dark cycle. The genome of the WH8501 strain was sequenced by the Joint Genome Institute, and the resulting publicly available sequence (accession number in Table 1) was used for comparisons in this study. The genome of strain WH0003 was sequenced prior to this study, with detailed methods described in (Bench et al. 2011). Briefly, a non-axenic culture was subjected to bead-beating to detach cells from their extracellular matrix (ECM), and subsequently cells were sorted using fluorescence activated cell sorting. The genomic DNA from the sorted cells was amplified using the GenomiPhi V2 amplification kit (Amersham Biosciences, Piscataway, NJ, USA).

Table 1. Crocosphaera watsonii phenotypes and strain origins
StrainPhenotypeYear isolatedOcean basin where isolatedLocation where isolatedGenome accessions
WH8501Small-cell1984S. Atlantic28°S, 48°WAADV00000000.2
WH8502Small-cell1984S. Atlantic26°S, 42°WCAQK01000001 - CAQK01000869
WH0401Small-cell2002N. equatorial Atlantic6°N, 49°WCAQM01000001 - CAQM01000918
WH0003Large cell2000N. Pacific (St. ALOHA)22°N, 158°WAESD01000001 - AESD01001126
WH0005Large cell2000N. Pacific (St. ALOHA)22°N, 158°WCAQL01000001 - CAQL01001266
WH0402Large cell2002S. equatorial Atlantic11°S, 32°WCAQN01000001 - CAQN01001343

Genomic DNA for the four additional strains described in this study was obtained using the same methods as the WH0003 strain (Bench et al. 2011), with the following modifications: the WH8502 and WH0401 strains do not produce large amounts of ECM, so they were sorted without bead-beating, and instead of GenomiPhi, the amplification kit used for all four strains was the REPLI-g Midi kit (Qiagen, Valencia, CA, USA). The REPLI-g amplification method was based on the protocol provided by Qiagen for “small numbers of cells or single cells” (http://www.qiagen.com/products/genomicdnastabilizationpurification/replig/repligminimidikits.aspx#Tabs=t2). Specifically, sorted cells (5,000–10,000 cells for each strain) were spun at 14,000 rpm (20,800 g) for 40 min and the supernatant was discarded. Pelleted cells were resuspended in 2.5 μL of PBS followed by 3.5 μL of Buffer D2 (see Qiagen protocol above), and lysed in a 65°C water bath for 5 min. The lysed cells were placed on ice, and lysis was terminated by adding 3.5 μL of Stop Solution (provided with kit). Amplification was immediately carried out in 50 μL reactions, which contained the cell-lysis mix plus 1 μL of REPLI-g Midi DNA Polymerase, 29 μL of REPLI-g Midi reaction buffer, and 10 μL of RT-PCR grade H20.

Prior to 454 sequencing, amplified genomic DNA was quantified using Pico Green (Invitrogen Corporation, Carlsbad, CA, USA). Using sorted cell amplified DNA, shotgun libraries for each strain were constructed and sequenced by the UCSC Genome Sequencing Center (http://biomedical.ucsc.edu/GenomeSequencing.html) on the Genome Sequencer FLX instrument using Titanium Series protocols according to the manufacturer's specifications (454 Life Sciences, Branford, CT, USA).

Genome assembly and annotation

For the four strains sequenced during this study, an average of 363,200 reads were generated per genome, with an average read length of 374 bp and ~135,900 kb of sequence data for each genome. The genome-wide average depth of coverage ranged from 23× to 30× for each strain, and while some variation in read depth was noted among contigs, the vast majority (86%–98%) of contigs in each genome had an average coverage of over 10 reads deep. The reads for each strain were assembled separately using Version 2.0.00 of the Newbler GS De Novo Assembler program (454 Life Sciences). The assembly was run via command line interface using the “-nrm”, “-consed”, and “-large” flags. All other parameters used were the default values, as described in the manufacturer's publication, “Genome Sequencer Data Analysis Software Manual”.

Open reading frames (ORFs) in each of the contig sequences for all four draft genomes were identified and annotated using RAST (Aziz et al. 2008). A small number of contigs [two contigs, ~1.3 kb, from the WH0401 genome, and 18 contigs (~14 kb) from the WH0005 genome] were removed from further analysis based on a lack of recognizable coding sequence and/or their lack of any homology to known cyanobacterial sequences. It should be noted that there is an unavoidable difficulty in ORF identification and annotation using multi-contig genomes. As such, there may be a small fraction of coding sequences in these genomes that were not properly identified, particularly at the ends of contigs. In addition to annotated ORFs, each genome contained a single rRNA operon and 39 tRNAs. The final number of bases and contigs in each genome, as well as the %G+C and number of annotated ORFs are listed in Table 2. The genome sequences and annotations deposited at DDBJ/EMBL/GenBank are publicly available at http://www.ncbi.nlm.nih.gov/ using the accession numbers listed in the Table 1.

Table 2. Genome sizes and gene content statistics for six Crocosphaera watsonii strains
 StrainTotal genome length (bp)Average% G+CNumber of ORFsNumber of transposase genesNumber (and%) of strain-specific ORFsaNumber of strain-specific transposase genesNumber of contigs
  1. a

    An ORF was considered strain-specific if it had no BLAST similarity to ORFs in any other genomes at 95% ID over 70% of the ORF length.

Small-cell strainsWH85016,238,15637.15,9581,211229 (3.8%)71320
WH85024,683,05237.64,965165104 (2.1%)4869
WH04014,551,01737.74,997166132 (2.6%)4918
Large-cell strainsWH00035,892,65837.76,145223223 (3.6%)91,130
WH00055,975,52437.65,919204167 (2.8%)101,266
WH04025,880,35837.76,471216315 (4.9%)191,343

Transposase genes were identified and assigned to IS families using the BLAST tool on the ISfinder website (http://www-is.biotoul.fr/is.html) with default parameters (Siguier et al. 2006). ORFs with protein BLAST (BLASTp) E-values of <10−3 were annotated as transposases, and assigned to the IS family of the top BLAST hit. Also, a small number of ORFs (10–18 per genome) were annotated with the transposase function by RAST, but did not have qualifying BLAST hits to the IS finder database. These were included in all of the transposase analyses, such as genome counts, and IS family tallies. ORFs without a qualifying IS finder hit and RAST annotation lacking IS family information were listed as “unknown” in the IS family counts. A similar process was used to identify transposase genes in WH0003 and WH8501, with one additional pre-analysis step for the WH8501 genome which involved identification and grouping of highly duplicated sequences (see methods in Bench et al. 2011).

Genome comparisons

Comparisons between all six genomes were based on nucleotide BLAST of ORFs. The ORFs from each genome were used as queries in BLAST comparisons against the other five genomes, and the criteria used to classify an ORF as shared between genomes was >95% nucleotide identity over at least 70% of the ORF length. These criteria were based on the observation that the nucleotide sequences for shared ORFs were generally >99% identical, and fell off rapidly below 98% (Figure S1 in the Supporting Information). The same similarity criteria were used to cluster ORFs within a single genome into repeated sequence groups using the CD-HIT web server (Li and Godzik 2006, Huang et al. 2010). To assess similarities across all six genomes, six tables of BLAST results (one for each genome versus the other five genomes) were merged using custom software according to sequence similarity and binned by the genomes in which the sequence was present. From the original 34,455 ORFs in the six genomes, this process produced a non-redundant set of 11,635 sequences that represented all ORFs in the six genomes.

Those 11,635 ORFs were grouped into 63 unique sharing patterns based on their presence–absence observed in the six genomes. The 63 patterns were further aggregated into categories according to the number of genomes where the ORF was present: there are six patterns for ORFs present in only one genome, 15 possible patterns for the category of ORFs present in two genomes, etc. Equal similarity among the six genomes would result in random sharing and the expectation of equal ORF counts for each pattern within each category. For example, there were 1,727 sequences in the 15 patterns where ORFs were present in two genomes and, in the case of random sharing, the expected ORF count would be 115.1 (i.e., 1727/15) in each. Chi-squared goodness-of-fit test was used to assess the statistical significance of differences between the observed and expected ORF counts in each pattern. A similar analysis was performed on ORF counts for patterns grouped according to strain phenotypes. For example, the 1,727 sequences shared by exactly two genomes were aggregated into three groups with two groups where the two genomes share a phenotype (both are large cell or both are small cell), and one group of ORFs found in a genome of each type. The random ORF sharing hypothesis was also tested for the grouped patterns using the chi-squared goodness-of-fit test.

Analysis and PCR of specific groups of functional genes

In order to investigate the phylogenetic relationships of the six C. watsonii strains, nucleotide sequences from 25 ORFs were concatenated and aligned to construct a phylogenetic tree and distance matrix. The 25 genes were chosen using the following criteria: (i) They were present in all six strains, (ii) they had some variation between strains (i.e., the vast majority of 100% identical ORF sequences could not be used), and (iii) they had homologs in the two Cyanothece species used as the outgroup (sp. 51142 and CCY0110, which are the two most closely related genomes available, based on 16S rRNA similarity). Because the C. watsonii strains are very closely related, nucleotide sequences were compared, rather than translated amino acid sequences. This allowed the analysis to take into account all possible sequence variation, including synonymous third position changes. The sequence IDs for the original 200 sequences (25 ORFs from each of eight genomes) that were concatenated are listed in Table S1 in the Supporting Information. Eight of the 150 Crocosphaera ORFs were split into two sequences, either because they were on two contigs or by an internal stop codon, which probably arose from sequencing error. For these sequences, the two sequences are listed together in a single cell, in the order in which they were aligned. The sequences were manually concatenated into a single sequence for each genome, and the resulting eight sequences were initially aligned using ClustalX v2.0.11 (Thompson et al. 1994, 1997, Larkin et al. 2007), followed by some manual curation and phylogenetic tree construction in MEGA4 (Tamura et al. 2007). The Neighbor-Joining method, with 1,000 bootstrap replicates, was used to construct the phylogeny (Felsenstein 1985, Saitou and Nei 1987). Evolutionary distances for the tree and distance matrix were calculated based on the same alignment, using the Jukes-Cantor method in MEGA4 (Jukes and Cantor 1969, Tamura et al. 2007). For both the phylogeny and distance matrix, all codon positions were included (1st, 2nd, 3rd, and noncoding), and positions containing alignment gaps and missing data were eliminated only in pair-wise sequence comparisons (pair-wise deletion option). There were 22,611 positions in the final dataset.

Because of prior observations of differences between strains in photosynthetic efficiency (Sohm et al. 2011), per-cell N2-fixation rates (Webb et al. 2009), and phosphorus scavenging genes (Dyhrman and Haley 2006), ORFs with functions related to these processes were compared. All ORFs in the each of the six C. watsonii genomes were compared to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and given KEGG orthology (KO) assignments using the single-directional best hit method to assign orthologs via the web interface of the KEGG Automatic Annotation Server (KAAS, http://www.genome.jp/tools/kaas/; Moriya et al. 2007). Using the KO description and/or the RAST annotated function, genes with roles in the structure or function of the two photosystems, N2-fixation, and phosphorus transport/metabolism were identified. The number of variants of each gene was totaled for each C. watsonii genome using the same criteria used to create the table of 11,635 sequences described above (>95% ID over >70% of the ORF).

The observation of at least one phenotype-specific variant of the isiA gene led to a more detailed analysis of those ORFs. The C. watsonii psbC and isiA genes were used as protein BLAST query sequences against the NCBI nr protein database, and the ten most similar sequences to each were retrieved, followed by removing redundant sequences. The resulting set of protein sequences were aligned along with the C. watsonii ORFs (58 sequences total) using the online Multiple Sequence Comparison by Log-Expectation (MUSCLE, http://www.ebi.ac.uk/Tools/msa/muscle/) tool with default parameters. A phylogenetic tree was generated from the resulting alignment using the UPGMA method with 500 bootstrap replicates in MEGA5 (Sneath and Sokal 1973, Felsenstein 1985, Tamura et al. 2011). Evolutionary distances, as the number of amino acid substitutions per site, were computed using the Poisson correction method (Zuckerkandl and Pauling 1965). All ambiguous positions were removed for each sequence pair, and there were a total of 776 positions in the final dataset. In the WH0402 genome, two of the isiA genes were split into two adjacent ORFs by a stop codon, with adjacent ORFs homologous to adjacent regions of full-length isiA sequences, suggesting that the stop codons may have arisen from sequencing errors. In both cases, the adjacent ORFs were merged prior to alignment, and are so noted on the phylogenetic tree. It should also be noted that WH0401 is not shown in Clade 1 because the ORF was truncated and could not be properly aligned. As such, it is possible the full-length gene is not present in the WH0401 genome, but given the location of the ORF at the end of a contig, it is more likely missing due to the multi-contig, draft status of the genome. The psbC clade was identified by examining the sequence alignment for the ~114 amino acid region between the 5th and 6th transmembrane domains of the protein that is known to be absent from IsiA proteins (Laudenbach and Straus 1988, Bricker 1990). The synteny of the genomic regions was examined using BLAST comparisons against WH0005 contig 0012 which was the longest contiguous sequence containing all isiA genes.

Prior to the sequencing of any Crocosphaera genome other than WH8501, fosmid libraries were constructed for multiple C. watsonii strains. The whole genome sequences, which were determined shortly afterward, made detailed analysis of those libraries redundant, so they are not included in this study. However, initial analyses of fosmid end sequences identified a gene in the library from a large-cell strain that was not present in the WH8501 genome. Three different variants of this gene, peptidoglycan-binding LysM:Peptidase M23B (referred to hereafter as lysM), were present in the WH8501 genome, and the fosmid end sequence was a fourth variant with <90% similarity to the other three variants. Based on that finding, a PCR assay was developed which provided support for the genomic comparisons. One reverse PCR primer was designed to a conserved region of the lysM gene (complementary to all four gene variants), and individual forward primers were designed for each variant using Primer 3 (Rozen and Skaletsky 1999). The sequences for all primers and PCR product sizes are given in Table S2 in the Supporting Information. PCR reactions were carried out in 50 μL reactions, containing 2 μL of template DNA from eight cultures (four large-cell types and four small-cell types). Final reaction concentrations of reagents were as follows: 1× PCR buffer; 2% DMSO, 0.2 mM each of dNTPs; 0.4 μM of each primer, and two units of Platinum taq (Invitrogen, Life Technologies, Grand Island, NY, USA). Reactions underwent an initial heating step of 94°C for 90 s, then 30 cycles of: 94°C for 30 s, 56°C for 60 s, 72°C for 150 s, followed by a final extension step of 70°C for 5 min, and holding at 4°C. PCR reactions with bands of the expected size on an agarose gel were verified by subcloning and sequencing in the pGEM-T vector system (Promega, Madison, WI, USA) or by direct sequencing following reaction clean-up with QIAquick PCR Purification kits (Qiagen). Sanger sequencing reactions and electrophoresis were completed at the UC Berkeley sequencing center using a 3730 DNA Analyzer (Applied Biosystems, Foster City, CA, USA) according to manufacturer's protocols. The PCR results are summarized in Table S3 in the Supporting Information, and for the six strains which now have genome sequence data, corresponding ORF IDs have been added for those with a positive PCR result.


Genome characteristics and sequence duplication within genomes

Genome sequence statistics were tabulated for all six strains, and are summarized in Table 2. The large-cell strains had genome sizes of nearly 6 Mb, while two of the small-cell strains had genomes closer to 4.5 Mb. WH8501 was the exception to this pattern, with the largest genome (6.2 Mb) of the six strains. The total number of ORFs per genome correlated closely with genome size, indicating similar average ORF sizes and similar coding percentages for all strains. The%G+C was nearly identical (37.6%–37.7%) for five of the strains, and the sixth, WH8501, was just below that at 37.1%. All six genomes contained a single rRNA operon, which were nearly identical among the strains. Within the 869 bp region examined in a previous study (Webb et al. 2009), there were four positions with single nucleotide differences among the six genomes. Three of these were identified by Webb et al. (2009) at alignment positions 179, 324, and 794, and the fourth was at alignment position 514. The difference observed at position 514 was a change from an adenine in five of the genomes, to a guanine in strain WH8502 (a strain which was not included in the Webb et al. (2009) study). The WH8501 genome contained a much larger number of transposase genes than any of the other five strains. Aside from WH8501, the large-cell strains had slightly higher genomic abundance of transposase genes than the small-cell strains. Two of the small-cell strains (WH8502 and WH0401) had the fewest (~100) strain-specific ORFs, while WH0402 had the most (over 300), and the remaining three strains (WH8501 and two large-cell strains) had ~200 strain-specific ORFs each.

The level of ORF duplication within genomes was assessed by clustering identical sequences into groups of repeated genes. For all genomes, aside from that of WH8501, highly repeated sequences were not found, and most ORFs grouped with only one or two other sequences (Table 3). WH8501 was the only strain with repeat-groups of greater than 10 sequences, which ranged from 14 to 277. Twelve groups had more than 30 copies and six were very large groups of over 100 copies of a sequence in the genome. Most of the repeated sequences (10 of the 12 groups) were annotated as transposase genes, and two (the 124 copy and the 84 copy clusters) had unknown functions, but are likely to be some type of mobile element based on their very high copy number. In addition to more transposase genes and much more genomic sequence duplication, the transposase genes in the WH8501 genome also had a very different composition when assigned to IS families (Figure S2 in the Supporting Information). Transposase genes in the other five genomes mostly fell into the IS200/605 and IS607 families, with smaller numbers observed in the IS4 and IS91 families. In contrast, the four most abundant IS families in the WH8501 genome were IS5, IS66, IS630, IS1380, and IS1634. The other five strains contained sequences in these families as well, but in very small numbers.

Table 3. Counts of repeated sequences in each Crocosphaera watsonii genome
GenomeNumber of repeats in each groupNumber of groups in genomeTotal sequences

Shared and unique ORFs among the six genomes

The ORFs of each strain were compared to the other five strains using nucleotide BLAST. If a query sequence was >95% identical over at least 70% of the length of the ORF, it was considered to be present in the reference strain. A high percentage (78%–89%) of ORFs in each large-cell strain was found in the genomes of the other two large-cell strains, while a smaller percentage (62%–70%) was found in the genomes of the three small-cell strains (Fig. 1 and Figure S1). The WH8501 genome shared the most ORFs (79%) with the WH0401 genome, and much less (67%–73%) with the other four strains. In contrast, a large percentage (78%–87%) of ORFs in the genomes of the other two small-cell strains (WH8502 and WH0401) was shared with all five other strains. Reciprocal genome comparisons did not yield the same percentages because of differences in the numbers of ORFs in each genome (Fig. 1 and Figure S1). For example, when the WH8502 ORFs were queried against the WH0003 genome, 80% of them were found. However, only 64% of the WH0003 ORFs were found in the WH8502 genome. This is not surprising given that the WH0003 genome contains 6,145 ORFs, while the WH8502 genome contains only 4,965 ORFs (Table 2).

Figure 1.

Percentage of ORFs shared between Crocosphaera watsonii strains. ORFs for each genome were used as query sequences against the other five genomes in nucleotide BLAST searches. Alignments >95% identity over at least 70% of the ORF were totaled and plotted as a percent of the total number of ORFs in the query genome. Small-cell strains are represented by red shades, and large-cell strains by shades of blue.

Based on the BLAST results of each genome against the other five, a single set of sequences was established that represented all ORFs present in all six genomes. Using the same criteria described above (95% identity over at least 70% of the length of the ORF), the 34,455 ORFs from all six genomes were grouped into 11,635 sequences. The presence or absence of each of those sequences in the six genomes is shown in Figure 2. A total of 3,825 sequences were present in all six genomes, which represented approximately 60% of ORFs in the largest genomes, and up to 80% of the smallest genomes. The nucleotide percent identity for these 3,825 sequences averaged 99.8%. Each genome also contained between ~100 and ~300 sequences that were strain-specific (i.e., absent from all other strains). See Table S4 in the Supporting Information for ORF IDs and functions of all strain-specific ORFs. For sequences found in exactly three strains, the largest category contained sequences that were present in only the three large-cell type strains (781 sequences), and the second largest was sequences found in the three small-cell strains (153 sequences; Fig. 2B). The remaining 909 sequences present in three strains fell into 18 different categories with 10–93 sequences in each (see Table S5 in the Supporting Information). The categories with sequences present in exactly two strains showed a similarly skewed pattern, with the largest numbers of sequences in categories segregated by phenotype (Fig. 2C and Table S5). There were 15 two-strain categories containing a total of 1,727 sequences. Of those, the six phenotype-specific categories contained over 1,250 sequences (751 in only large-cell strains, and 502 in only small-cell strains, or an average of over 200 sequences per category), with only 474 sequences total in the remaining nine categories (average of 53 sequences per category). Overall, 3,825 sequences were shared among all six strains, 2,237 sequences were found exclusively in large-cell strains (in one, two, or three genomes), and 1,121 were found only in small-cell strains, with the remaining 4,452 present in at least one strain of each phenotype.

Figure 2.

Presence/absence of all ORFs in six Crocosphaera watsonii genomes. Presence (gray) or absence (black) of 11,635 sequences that represent all ORFs in the genomes of six strains (A). Each strain is represented by the column above the strain names, and each row represents one sequence. Rows are grouped by the number of strains in which the sequence is found, and the total number of sequences in each category is listed on the left. The dendogram above the columns is based on the presence/absence pattern for all 11,635 rows. Zoomed in views of the sequences found in three strains (B) and two strains (C) are shown with totals for subcategories listed on the left.

The observed tendency of sequences to segregate by phenotype was tested for statistical significance using the chi-squared goodness-of-fit test. Sequences that were shared between at least two genomes, but not present in all genomes were included in the analysis. After binning by the number of genomes in which the sequence was found, the observed number of sequences in each pattern-group (i.e., category) was compared to the expected value if all categories were equal, and the differences between the observed and expected values were used in the statistical tests (Fig. 3). In all cases, the deviation from expected values was statistically significant (P < 10−15). The largest difference in observed values above expected values were generally in categories of a single phenotype, particularly in sequences found in the three large-cell genomes. Among sequences present in four genomes, there were four categories with many more sequences than the expected values (Fig. 3). Two of these were categories where a sequence was present in all of the small-cell genomes, plus one large-cell genome, and two categories were equally split between phenotypes (i.e., two large cell and two small cell). The categories in the five-genome bin showed the least amount of deviation from expected values. Among those, the smallest category contained sequences present in the five genomes not including WH0003, which had an observed value well below the expected value. In categories where the observed values were far below the expected values, there was no consistent pattern of genomes or phenotypes that were included or excluded (Fig. 3). A similar statistical analysis was conducted on counts of sequences in categories that were further binned based on strain phenotypes. In that analysis, the expected values were the product of the expected value for a single category (as described above) multiplied by how many categories were binned. For sequences found in two or three genomes, all bins that included only a single phenotype had observed values much above the expected values, and in mixed phenotype bins the observed values were well below expected values (Figure S3 in the Supporting Information). As would be expected from differences in observed and expected values, the highest contribution to the chi-squared statistic was from the bin of sequences found only in the three large-cell genomes, followed by sequences found in only two large-cell genomes.

Figure 3.

Counts of ORFs found in genomes of 2, 3, 4, and 5 Crocosphaera watsonii strains. Observed counts of ORFs for each category, indicated by bars, were compared to the expected count for each category, indicated by black lines. Categories were binned by the number of genomes in which a sequence was present, and expected counts were calculated by assuming all categories were equally likely in each bin. The 6-genome presence/absence pattern for each category is indicated by the boxes below the bars (black= sequence is absent for that genome, colored = present). The chi-squared goodness-of-fit test indicated statistically significant difference between observed and expected counts for all categories (n = 6640, Df = 55, χ2 = 8469, P < 10−15).

In a previous study of the WH0003 genome, a 25 kb region was identified as likely to be critical to EPS production because it contained a number of EPS-biosynthesis genes and was present in WH0003 genome, but absent from the WH8501 genome (Bench et al. 2011). Not surprisingly, most of the ORFs in this region were also absent from the other two small-cell strains, and were present in all large-cell strains (Table 4). Furthermore, in the large-cell strains, all of the genes were 100% identical at the nucleotide level over the full lengths of the ORFs.

Table 4. Presence/absence of genes in putative EPS-critical region
WH0003WH0005WH0402WH0401WH8501WH8502W0003 ORF IDAnnotated function
11100.941CWATWH0003_3496Hypothetical protein, similar to glycosyl transferase
111001CWATWH0003_3497Short-chain dehydrogenase/reductase SDR
111000.98CWATWH0003_3498Hypothetical protein
111001CWATWH0003_3499Sugar transferase involved in lipopolysaccharide synthesis
111001CWATWH0003_3500Pyruvate dehydrogenase (lipoamide)
111000CWATWH0003_3501Pyruvate dehydrogenase (lipoamide)
111000CWATWH0003_3502Putative aldo/keto reductase
111000CWATWH0003_3504Glycosyl transferase, group 1
111000CWATWH0003_3505WblG protein
111000CWATWH0003_3506Hypothetical protein
111000CWATWH0003_3507O-antigen translocase (like Wzx)
111000CWATWH0003_3508DegT/DnrJ/EryC1/StrS aminotransferase family protein
111000CWATWH0003_3509Hypothetical protein
111000CWATWH0003_3510Hypothetical protein
111000CWATWH0003_3511Hypothetical protein
111000CWATWH0003_3512Acetyltransferase, putative
111000CWATWH0003_3513Oxidoreductase domain protein
111000CWATWH0003_3514Putative UDP-N-acetyl-D-mannosamine 6-dehydrogenase
111000CWATWH0003_3515Polysaccharide biosynthesis protein CapD
111000CWATWH0003_3516Polysaccharide export protein (like Wza)
111000CWATWH0003_3517Uncharacterized protein involved in exopolysaccharide biosynthesis (like Wzc)
111000CWATWH0003_3518Hypothetical protein
11110.990CWATWH0003_3519Animal hem peroxidase
111000CWATWH0003_3520a1Transposase, IS200/IS605 family

Phylogenetic analysis and comparison of metabolic capabilities

The evolutionary relationships of the six C. watsonii strains and two closely related Cyanothece species were examined using an alignment of 25 functionally unrelated genes and the corresponding phylogenetic tree and distance matrix. As expected, the two Cyanothece species clustered together as an outgroup to the six Crocosphaera strains (Fig. 4). The six Crocosphaera strains clustered into two subclades with the three large-cell strains in one clade, and the three small-cell strains in the second clade. Over the entire 22 kb alignment, the distances between the Crocosphaera strains was very small (Table S6 in the Supporting Information). Within each of the phenotype subclades, distances ranged from 0 to 0.009 substitutions per site, and between the two clades distances were between 0.024 and 0.028. The distance between Crocosphaera strains and Cyanothece sp. was ~0.16 substitutions per site.

Figure 4.

Phylogenetic relationship of six Crocosphaera watsonii strains and two Cyanothece species, based on 25 genes. Evolutionary relationships were inferred based on a 25 kb alignment of 25 concatenated genes using the Neighbor-Joining method and the percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1,000 replicates) are shown next to the branches. The optimal tree with the sum of branch length = 0.24091980 is shown drawn to scale, with branch lengths in units of base substitutions per site.

In addition to the genes coding for EPS-biosynthesis described above, the six C. watsonii genomes were explored for the presence and variants of genes involved in N2-fixation, iron and phosphorus scavenging and metabolism, and photosynthesis. All genes related to N2-fixation that were examined (nifB, D, E, H, K, N, T, U, V, V, W, X, and Z, and glnB) were present in a single copy in all six genomes. Phosphorus scavenging and metabolism genes were less uniformly found in the genomes (Table 5). The pst genes were often present in multiple copies, with a range of copy numbers per genome that did not appear to correlate with phenotype, except for the presence of more pstS copies in the large-cell strains. There was a distinction between the phenotypes in the various forms of alkaline phosphatase, with phoD present in only large-cell strains, and phoA found exclusively in small-cell strains (Table 5). A single gene copy was found in all genomes for most of the other phosphorus-related genes. Finally, the total number of phosphorus-related genes examined was higher (30–32) in the large-cell strains than the small-cell strains (19–25).

Table 5. Counts of phosphorus-related genes in each Crocosphaera watsonii genome
GeneKEGG Orthology numberFunctionWH0003WH0005WH0402WH0401WH8501WH8502
phnD K02044Phosphonate binding222122
phoB K07657Phosphate regulon transcriptional regulator111111
phoH K06217Phosphate starvation-inducible protein111111
phoR K07636Phosphate regulon sensor111112
phoU K02039Phosphate transport system regulator111111
pstA K02038Phosphate transport system permease333131
pstB K02036ATP-binding phosphate transport334333
pstC K02037Periplasmic phosphate-binding ABC-transporter222112
pstS K02040High-affinity phosphate-binding657334
phoD K01113Phosphodiesterase/alkaline phosphatase D222000
phoA  Alkaline phosphatase000111
  Alkaline phosphatase (non-phoD & non-phoA)242001
dedA  Alkaline phosphatase-like111111
pitA K03306Inorganic phosphate transporter111111
ppa K01507Inorganic pyrophosphatase212112
ppk K00937Polyphosphate kinase111111
ppx K01524Exopolyphosphatase111111

Among iron-related genes, many did not vary in copy number among genomes (e.g., fur, tonB, and exbB/D), while others showed different patterns of copy numbers among strains (Table S7 in the Supporting Information). Some were variable, but copy numbers did not correlate with phenotype or strain origins, while others had higher copy numbers in the large-cell strains, including isiA and feoB. Similar to the phosphorus-related genes, but with a smaller difference, the total number of iron-related genes was higher in large-cell strains (29–33) than in small-cell strains (25–28). Photosystem genes showed less variability among strains than phosphorus and iron-related genes. With very few exceptions, photosystem I (PSI) genes were present as a single copy in all six genomes, although a few ORFs which were split by a stop codon (likely sequencing error), or onto two contigs (Table S8 in the Supporting Information). Similarly, most photosystem II (PSII) genes were also present in a single copy in the six genomes. Exceptions included the following: psb28, which was present in two variants in all six genomes and psbD, for which a second variant was found in two genomes and some differences among genomes in the number of variants of psbA. A sequence and alignment-based examination of the psbA ORFs revealed the presence of two isoforms in the genomes. Most of the Crocosphaera predicted PsbA proteins were the D1:1 isoform, based on the presence of a glutamine in the D1:1/D1:2 determinant position (Garczarek et al. 2008). There was one complete copy of the D1:1 gene in the WH8501 strain, two complete copies in WH8502, one complete and one partial copy in WH0005, two partial copies that appear to be one gene (based on start/stop) in WH0401, and multiple partial genes in the WH0003 genome. Because of the multi-contig nature of the genome and the observation that most partial ORF sequences are at the ends of contigs, it is difficult to know how many full-length genes are present in the WH0003 genome. The D1:1 isoform was not found in the WH0402 genome; however, all six genomes contained exactly one copy of a more divergent isoform with a leucine at the D1:1/D1:2 determinant position along with a variety of other amino acid changes throughout the protein.

Phylogenetic analysis of the C. watsonii isiA genes and a closely related psbC gene revealed four variants of isiA in the C. watsonii genomes, three of which correlated strongly with phenotype. The psbC gene and one of the isiA genes (Clade 1 in Figure S4 in the Supporting Information) were present in all six strains, and both of those genes were most similar to closely related cyanobacteria (e.g., Cyanothece). The other three isiA variants (Clades 3, 4, and 5 in Figure S4) were most similar to Trichodesmium erythraeum sequences, one of which (Clade 4) had no homologs to any other organism. The sequences in Clade 5a were divergent enough from the 5b sequences that they did form a single clade in the phylogenetic tree. However, manual inspection of the sequence alignment revealed that they were substantially longer than ORFs in other four clades due to the presence of a PsaL domain at the C-terminus of the amino acid sequence. As such, they were labeled with the same clade number to indicate their apparent functional similarity. In addition, the three Trichodesmium-like isiA variants were found almost exclusively in the large-cell strains, and were immediately adjacent to each other in those genomes, with a flavodoxin gene immediately upstream (Fig. 5). Aligning the C. watsonii genomic contigs with the T. erythraeum genome illustrated that synteny for the three isiA genes was conserved, but the T. erythraeum flavodoxin gene was in a distant genomic region. Additionally, the genomes of the small-cell strains contained only fragments of this region, and in the WH8501 genome, these fragments were located on much larger contigs adjacent to sequences that were not similar to the aligned region (Fig. 5).

Figure 5.

Illustration of alignment of genome regions containing iron-related genes. Genomic regions were identified and aligned using BLAST comparisons to the longest contiguous sequence containing all isiA genes (WH0005 contig0012). Ends of contigs are shown by straight edges, and wavy edges indicate the contig continues but without similarity to the aligned region shown. If no contig is shown for a portion of the alignment, there are no similar sequences to that region anywhere in the genome of that strain. Contig IDs are listed in white text for all Crocosphaera species and the GenBank locus tags are shown in gray text for the Trichodesmium genomic regions.


Broad genome observations and transposase abundances

The C. watsonii WH8501 genome appears to be unusual among this group of Crocosphaera genomes in a number of respects, including genome size and transposase abundance (Table 2), IS family distribution (Figure S2), and repeated genomic sequences (Table 3). There is a possibility that some differences stem from the fact that the genomic DNA preparation and sequencing methods for WH8501 were different from the other five strains (i.e., no cell sorting, and the use of Sanger sequencing rather than pyrosequencing). However, it is not clear how such methodological differences would have led to the observed genomic differences. PCR experiments from 16 separate loci have been based on the WH8501 genomic sequence and none have found unexpected results. Four of those loci are described in this study (Tables S2 and S3), four were developed for another project by the authors of this study (data not shown), three were described in Dyhrman and Haley (2006), and five were described in Zehr et al. (2007). Furthermore, a whole genome microarray was designed based on the WH8501 genome sequence, and subsequent experiments using cultured cells have not shown any systematic problems that might call into question the validity of the genome sequence (e.g., Shi et al. 2010). Transposase content (total length of 1,211 transposase ORFs = 919,337 bp) accounts for most of the extra ~1.5 Mb in the WH8501 genome compared to the other two small-cell strains. Of the six draft genomes, the WH0402 genome has the largest number of contigs and smallest average contig length and has the highest ratio of ORFs to genome size. As such, ORFs were more likely to be split between two contigs, and if both fragments were annotated, the gene function would appear duplicated, with a “copy” on each contig. This was seen in the genome counts of iron-related and photosystem genes (Tables S7 and S8), where WH0402 had multiple genes that were separated into two ORFs, both shorter than expected, and annotated with the same function. This type of ORF-splitting may partly explain why the WH0402 genome has the highest number and percentage of strain-specific ORFs (Table 2).

The extremely large numbers of transposase genes in the WH8501 genome are mostly shared with at least one other strain, with only 71 being strain-specific (Table 2). It is not clear why the high level of gene duplication observed in WH8501 transposase genes was not observed in any of the other genomes. The relative abundances of IS families among the WH8501 strain-specific transposases are not proportional to abundances in the whole genome (Figure S2). Three of the four IS families that are most abundant in the non-WH8501 strains have the unusual property of lacking associated inverted repeat sequences, while the most abundant families in WH8501 are more typical insertion sequences with associated inverted repeats (Siguier et al. 2006). The lack of inverted repeats may partly explain why the insertion sequences in five of the genomes are not highly replicated. IS elements are known to confer adaptive advantages such as acquisition of new metabolic capabilities and increased genomic plasticity via genomic insertions, deletion, and homologous recombination between multi-copy elements (Chandler and Mahillon 2002, Lysnyansky et al. 2009). As such, the larger number of IS elements in the WH8501 genome, particularly those present in many identical copies, could provide that strain with an increased ability to adapt to environmental changes. Future work could test this idea through competitive growth experiments of multiple strains conducted under changing physical and/or chemical conditions.

Sequence conservation and shared versus specific genes among genomes

The very high nucleotide percent identity (99.8%) for sequences shared among all six Crocosphaera strains illustrates that there has been very little mutation accumulation since strain divergence. This level of identity is much higher than what has been observed in sympatric cyanobacteria, even among very closely related strains (Rocap et al. 2003, Rusch et al. 2007, Dufresne et al. 2008, Scanlan et al. 2009). There are a few single nucleotide differences among C. watsonii 16S rRNA sequences, yet their genomic nucleotide identity is higher than the ~97% average genomic nucleotide identity in Baltic Sea heterotrophic Shewanella species with identical 16S rRNA sequences (Caro-Quintero et al. 2011). The closest example to the high level of identity in Crocosphaera is a study of two populations of marine Vibrio species with >99% average identity in amino acid sequences (Shapiro et al. 2012). However, those Vibrio populations were recently diverged, and also had an entire second genomic chromosome that differed between populations. Species from non-marine environments also appear to accumulate sequence mutations more rapidly than Crocosphaera. For example, Leptospirillum that evolved in an acid mine drainage system over a period of only 9 years accumulated genome-wide sequence differences of 6% between strains (Denef and Banfield 2012). The mechanism by which Crocosphaera prevents the accumulation of DNA mutations remains unclear. No notable genes or unusual sequences were found during examinations of known DNA replication and repair proteins in the genomes of the six strains examined in this study. Future work is clearly needed to address this question. Directed mutation studies would be ideal to help identify relevant genes, but no strains of Crocosphaera are known to be genetically transformable. A less efficient strategy could be to use chemical mutagens to generate strains with higher mutation rates, whose genomes could then be examined for altered or missing genes related to DNA replication or repair.

Despite sequence conservation in shared sequences, genome size and reciprocal genomic comparisons show that the larger genomes of the large-cell strains contain functions that are missing from the small-cell strains (Table 2; Fig. 1 and Figure S1). Some evidence suggests that the large-cell specific functions may have been lost from the small-cell strains since divergence (Fig. 5). However, it is also possible that some functions have been acquired after phenotypic divergence. There are roughly 1,000 more ORFs in the large-cell genomes than two of the small-cell strains, and the large-cell genomes are more similar to each other than to the small-cell genomes (Figs. 1 and 4). Because the large-cell specific ORFs have no similarity to sequences in the small-cell strains, gene duplication cannot explain their larger genome sizes. Furthermore, replicated sequences could partly explain the larger genome of WH8501, but cannot explain the larger genomes of the large-cell strains that do not have genomic sequence duplication (Table 3).

The presence/absence patterns of the ORFs in the six genomes further shows that the large-cell strains harbor functions that are missing from the other three strains. There were nearly 800 sequences shared among all three large-cell strains, and absent from all small-cell strains (Fig. 2B). Of those sequences, 65% (501) were hypothetical or unknown, and only 3% (25) were transposases. The remaining 246 sequences were annotated with a wide variety of functions, including a number with functions related to DNA metabolism and modification, such as single-stranded DNA binding proteins, and DNA polymerases and primases (functions listed in Table S9 in the Supporting Information). In addition, large-cell-specific sequences also include the EPS-biosynthesis pathway genes identified in the previous genome comparison of two strains (Bench et al. 2011). In fact, 17 of the 23 genes identified in a large deletion from the WH8501 genome are also absent from the other two small-cell strains, but present in all three strains characterized by abundant EPS production (Table 4).

Genomic and metabolic differences between phenotypes

The dendrogram based on the presence/absence patterns of all possible 11,635 sequences (top of Fig. 2A), and the fraction of shared genes between strains (Fig. 1) both demonstrate that the C. watsonii strains which share a phenotype are more closely related than those that share a similar cultivation origin (ocean basin, or year of isolation, see Table 1). The three large-cell strains share many more ORFs among their group than they do with any of the small-cell strains (Fig. 1 and Figure S1). In addition, the categories with sequences found exclusively in large-cell strains were the most overrepresented and contributed most to the statistical significance of differences between observed and expected counts of sequences (Fig. 3 and Figure S3), despite being isolated from different ocean basins (Table 1). Similarly, the three small-cell strains cluster together, yet WH0401 was isolated from the North Atlantic 20 years after WH8501 and WH8502 were isolated from the South Atlantic. The clustering of the two strains isolated in 2000 (WH0003 and WH0005), and in 1984 (WH8501 and WH8502) is likely a result of their shared phenotype, rather than shared isolation history. While additional genomic sequence for a co-isolated strain of the opposite phenotype would be required to completely verify that theory, a PCR-based analysis of four-genes showed consistent genetic differences between phenotypes even in co-isolated strains. Specifically, a small-cell strain (WH0004) isolated in the same year, season, and region as WH0003 and WH0005 (two large-cell strains) showed the expected pattern for small-cell strains for the LysM genes (Table S3). Further evidence of phenotypic clustering is provided by the phylogenetic tree in Figure 4, where C. watsonii strains cluster by phenotype with 100% support, and by the resulting distance matrix in Table S6, which shows almost zero evolutionary distance among the phenotypes, and larger distances between strains of opposite phenotypes.

While many genes and functions were redundant within the genomes of the six strains, no redundancy was observed in genes critical for N2-fixation or most of the photosystem genes (Table S8). In contrast, a number of iron- and phosphorus-related genes were present in multiple copies in the genomes, with some copies specific to a phenotype (Table 5 and Table S7). For example, the observation that large-cell genomes contained more copies of genes in the high-affinity phosphate transport system (i.e., the pstSCAB operon, which is known to be upregulated during phosphorus starvation (Dyhrman and Haley 2006)) indicates that large-cell strains may have larger phosphorus requirements and/or may be better adapted to low phosphate conditions. It is also interesting that the genomes of the two phenotypes contain different variants of alkaline phosphatase (Table 5) which may differentiate the phosphorylated substrates utilized by each phenotype. The higher number of iron metabolism genes in the large-cell strains may signify that those strains also are more capable of thriving under low iron conditions. Studies that have directly examined the response of cultivated C. watsonii to changes in Fe and P have observed dramatic diel recycling of iron metalloproteins (e.g., photosynthesis and N2-fixation proteins) as well as changes in growth, gene expression, and nitrogen fixation rates (Webb et al. 2001, Tuit et al. 2004, Falcon et al. 2005, Dyhrman and Haley 2006, Fu et al. 2008, Compaoré and Stal 2010, Shi et al. 2010, Saito et al. 2011). However, those studies were carried out almost exclusively on the C. watsonii WH8501 strain, so future experiments with additional strains will be needed to verify whether the phenotypes are adapted differently to low or changing nutrient levels. In addition, metatranscriptomic or metaproteomic studies could be used to examine the expression of phenotype-specific genes in natural populations.

Evidence of genome evolution in photosynthesis gene clusters

Phylogenetic analysis of isiA and psbC genes produced three distinct groups of sequences. The psbC sequences formed a clade (Figure S4, Clade 2 in gray box), with 100% bootstrap support and shorter branch lengths than the isiA clades, indicating less sequence divergence in psbC. This is not surprising because PsbC, a chlorophyll binding protein, is a critical component of PSII under considerable selective pressure (Chisholm and Williams 1988, Ananyev et al. 2005). The iron starvation-induced chlorophyll binding protein (IsiA), a closely related homolog of PsbC, has at least three known functions: (i) chlorophyll storage during iron-limited conditions, (ii) dissipation of light-excitation energy, and (iii) a light antennae in PSI (Sandström et al. 2001, Singh and Sherman 2007, Chauhan et al. 2011). There are intriguing differences in the presence of isiA genes in the six C. watsonii genomes, with only one of the isiA variants being found in all six strains (Clade 1 in Figure S4). All of the C. watsonii (and Trichodesmium IMS101) isiA genes in Clade 1, including the truncated WH0401 sequence, are adjacent to a downstream flavodoxin (fldA or isiB) ORF that is identical among the six strains (100% nucleotide identity). Flavodoxin is an iron-free replacement for ferredoxin, the iron–sulfur electron transfer protein important in N2-fixation and CO2 fixation, and as in the C. watsonii genomes, isiB is commonly found in a single operon with isiA (Singh and Sherman 2007, Chauhan et al. 2011). As such, this cluster of genes is likely to be important for Crocosphaera during periods of iron limitation, which are common in oligotrophic habitats.

The isiA genes in Clades 3, 4, and 5 appear to have a different evolutionary history than the Clade 1 genes (Figure S4). This is illustrated by the relatively long branch length between Clade 1 and the rest of the isiA genes, as well as the observation that only Clade 1 isiA ORFs are found in the species most closely related to C. watsonii by nifH and 16S rRNA phylogenies (e.g., Cyanothece spp.), while the other three variants are only found in Trichodesmium and more distantly related cyanobacteria. Furthermore, the Clade 1 isiA genes were found in a different genomic location than the three other variants (referred to hereafter using the clade designations from Figures 5 and S4; C3, C4, and C5) which are directly adjacent to each other in the genomes of the three large-cell strains and immediately downstream (Fig. 5) of a second flavodoxin gene (distinct from the isiB genes found adjacent to the Clade 1 isiA genes). Additionally, despite amino acid sequence divergence of over 20% between the two species, Trichodesmium has conserved synteny for the three adjacent isiA ORFs, illustrating that gene order has been maintained since they diverged from their last common ancestor. The genome of the more distantly related cyanobacterium Anabaena PCC7120 also contains two adjacent genes (locus tags all4002 and all4003) with similarity to two of the adjacent Crocosphaera isiA-like ORFs (C5 and C3, respectively). Syntenty does not extend beyond those two genes, but is interesting to note that in PCC7120, these ORFs are part of a cluster of four adjacent isiA-like ORFS, but neither of the other two ORFs are more similar to the Crocosphaera C4 isiA ORFs. The PCC7120 genome also contains an ORF similar to the Crocosphaera isiB ORF shown in Figure 5, but that ORF is located in a separate genomic location, as was observed in the Trichodesmium genome. Given the apparent ancient evolutionary origin of the Crocosphaera isiA gene cluster, it is surprising that the small-cell strains are missing most of these genes. However, the pattern of remaining genomic fragments in the small-cell genomes (Fig. 5) demonstrates a generalized loss of genetic material suggested by the smaller genome sizes of the small-cell strains and by the significant number of genes shared among large-cell strains, but absent from small-cell strains. Evidence for a mechanism leading to such genetic loss is provided by the WH8501 genome, where the isiA C4 gene is present, but the isiA C3 ORF is truncated by the presence of a transposase gene and the resulting contig sequence continues for some length without any sequence similarity to isiA C3 or isiA C5 (Fig. 5). This is the type of gene loss that would result from genetic rearrangements, which would be expected in a genome containing abundant transposase genes. Finally, because IsiA plays an important role in photosynthesis during iron-limited conditions (Chauhan et al. 2011), it is possible that the additional copies of the isiA gene in the genomes of the large-cell strains could make them better able to continue photosynthesis in low iron environments. This may also explain the observations of higher photosynthetic efficiency (Fv/Fm) in the large-cell phenotype strains (Sohm et al. 2011).

The apparent genome degradation observed in the small-cell Crocosphaera strains relative to the large-cell strains provides insight into the evolutionary history of the species. Bacterial genomes are known to degrade over time, with non-critical genes more likely to be lost. An example of this is the genome reduction observed in pathogenic bacteria during adaptation to host environments (Rau et al. 2012). Further, in small effective populations, such as pathogens recently adapted to a host, there is often a higher incidence of mobile genetic elements and a higher likelihood that deleterious mutations in important genes will become fixed within the population (Ochman and Davalos 2006). Evidence for these evolutionary forces is apparent in the Crocosphaera genomes, where critical genes, such as the nif operon, are present with 100% sequence identity in all of the strains, while regions containing redundant genes, such as the isiA genes (Fig. 5), show evidence of degradation, partly through the activity of mobile genetic elements. A number of studies have demonstrated that very closely related bacterial species can diverge from each other and adapt to their respective environments by acquiring a relatively small number of specific genetic capabilities that provide a metabolic advantage, and such adaptation is often facilitated by a high level of lateral gene transfer. This process has been observed in a number of species across different habitats, including marine Vibrio species (Shapiro et al. 2012), Escherichia coli from soil and freshwater (Luo et al. 2011), Shewanella from the Baltic Sea (Caro-Quintero et al. 2011), and Leptospirillum in an acid mine drainage system (Denef and Banfield 2012). In the marine environment, there is evidence that the most abundant cyanobacterial genera, Prochlorococcus and Synechococcus, have evolved into distinct ecotypes through gain and loss of hyper variable genomic islands that appear to be laterally transferred. Most genes in those islands have unknown functions, but others have clearly adaptive functions including cell wall proteins (affecting grazing), DNA mobility, and genes that are differentially expressed during light stress or nutrient limitation (Coleman et al. 2006, Dufresne et al. 2008, Scanlan et al. 2009). Furthermore, it has been proposed that the higher abundance of genetic regulatory systems in costal Synechococcus strains (versus open ocean strains), makes them better adapted to deal with the environmental fluctuations that occur more commonly in coastal ecosystems (Dufresne et al. 2008). As such the genomic differences described in this study make it likely that the two Crocosphaera phenotypes are adapted to different marine environments or niches. Although the strains have been isolated from marine habitats with similar chemical and physical characteristics, there may be subtle environmental differences that have yet to be identified.


The vast majority of genes in each of the six Crocosphaera genomes were shared with at least one other strain, many with multiple strains, and a large fraction were shared among all six strains at >99% nucleotide identity, which was not surprising in light of previous studies that have found a high degree of genetic sequence conservation in the species. The genome of WH8501, which has been the type-strain for the species for decades, was found to be surprisingly unique within the small-cell phenotype and among this group of isolates in a number of respects, such as a larger genome, much more abundant transposase genes, and much higher levels of gene duplication. This calls into question whether WH8501 should continue to be used as the type-strain for the species in future studies. The various genomic and statistical analyses described here show that C. watsonii strains with the same phenotype cluster together, while similar clustering was not observed in strains with temporal or spatial proximity of isolation. Despite substantial genetic similarity among the genomes of the six strains, the strain-specific and phenotype-specific genes identified in this comparison apparently provide enough differences to result in phenotypic divergence. The resulting phenotypes are genetically characterized by small-cell strains with smaller genomes and apparent gene loss, and larger genomes and more redundancy in genetic and metabolic capabilities in the large-cell strains. Finally, there is some evidence that among the redundant genes are capabilities which may make the large-cell strains better adapted to iron and phosphorus limited environments. The genome sequences analyzed in this study provide important data that can be applied in future studies to test such hypotheses in isolated Crocosphaera strains as well as natural populations.

The authors acknowledge funding from NSF grant EF0424599 for the Center for Microbial Oceanography: Research and Education (C-MORE) and from the Gordon and Betty Moore Foundation Marine Microbiology Initiative (to J.P.Zehr). We are also grateful to John Waterbury for providing Crocosphaera isolates and to Brandon Carter and the MEGAMER facility for assistance with flow cytometry and cell sorting. We thank Eric Webb and Jack Meeks for insight and discussion, and Mary Hogan, Kendra Turk, Jim Tripp, Jason Hilton, Julie Robidart, Anne Thompson, and Deniz Bombar for technical and scientific input. Finally, we thank two anonymous reviewers for their astute observations and suggestions that helped improve the manuscript.