Strain growth and genomic DNA isolation and sequencing
The phenotypes, isolation location and genome GenBank accession numbers for C. watsonii strains described in this study are listed in Table 1 and all strains have been previously described (Waterbury et al. 1986, 1988, and Webb et al. 2009). All strains were grown in nitrogen-free SO medium (Waterbury et al. 1986, 1988) in polycarbonate tissue culture flasks with a 0.2 μm pore-size vent cap (Corning Inc., Corning, NY, USA) at 26°C under a 12:12 h light/dark cycle. The genome of the WH8501 strain was sequenced by the Joint Genome Institute, and the resulting publicly available sequence (accession number in Table 1) was used for comparisons in this study. The genome of strain WH0003 was sequenced prior to this study, with detailed methods described in (Bench et al. 2011). Briefly, a non-axenic culture was subjected to bead-beating to detach cells from their extracellular matrix (ECM), and subsequently cells were sorted using fluorescence activated cell sorting. The genomic DNA from the sorted cells was amplified using the GenomiPhi V2 amplification kit (Amersham Biosciences, Piscataway, NJ, USA).
Table 1. Crocosphaera watsonii phenotypes and strain origins
|Strain||Phenotype||Year isolated||Ocean basin where isolated||Location where isolated||Genome accessions|
|WH8501||Small-cell||1984||S. Atlantic||28°S, 48°W||AADV00000000.2|
|WH8502||Small-cell||1984||S. Atlantic||26°S, 42°W||CAQK01000001 - CAQK01000869|
|WH0401||Small-cell||2002||N. equatorial Atlantic||6°N, 49°W||CAQM01000001 - CAQM01000918|
|WH0003||Large cell||2000||N. Pacific (St. ALOHA)||22°N, 158°W||AESD01000001 - AESD01001126|
|WH0005||Large cell||2000||N. Pacific (St. ALOHA)||22°N, 158°W||CAQL01000001 - CAQL01001266|
|WH0402||Large cell||2002||S. equatorial Atlantic||11°S, 32°W||CAQN01000001 - CAQN01001343|
Genomic DNA for the four additional strains described in this study was obtained using the same methods as the WH0003 strain (Bench et al. 2011), with the following modifications: the WH8502 and WH0401 strains do not produce large amounts of ECM, so they were sorted without bead-beating, and instead of GenomiPhi, the amplification kit used for all four strains was the REPLI-g Midi kit (Qiagen, Valencia, CA, USA). The REPLI-g amplification method was based on the protocol provided by Qiagen for “small numbers of cells or single cells” (http://www.qiagen.com/products/genomicdnastabilizationpurification/replig/repligminimidikits.aspx#Tabs=t2). Specifically, sorted cells (5,000–10,000 cells for each strain) were spun at 14,000 rpm (20,800 g) for 40 min and the supernatant was discarded. Pelleted cells were resuspended in 2.5 μL of PBS followed by 3.5 μL of Buffer D2 (see Qiagen protocol above), and lysed in a 65°C water bath for 5 min. The lysed cells were placed on ice, and lysis was terminated by adding 3.5 μL of Stop Solution (provided with kit). Amplification was immediately carried out in 50 μL reactions, which contained the cell-lysis mix plus 1 μL of REPLI-g Midi DNA Polymerase, 29 μL of REPLI-g Midi reaction buffer, and 10 μL of RT-PCR grade H20.
Prior to 454 sequencing, amplified genomic DNA was quantified using Pico Green (Invitrogen Corporation, Carlsbad, CA, USA). Using sorted cell amplified DNA, shotgun libraries for each strain were constructed and sequenced by the UCSC Genome Sequencing Center (http://biomedical.ucsc.edu/GenomeSequencing.html) on the Genome Sequencer FLX instrument using Titanium Series protocols according to the manufacturer's specifications (454 Life Sciences, Branford, CT, USA).
Genome assembly and annotation
For the four strains sequenced during this study, an average of 363,200 reads were generated per genome, with an average read length of 374 bp and ~135,900 kb of sequence data for each genome. The genome-wide average depth of coverage ranged from 23× to 30× for each strain, and while some variation in read depth was noted among contigs, the vast majority (86%–98%) of contigs in each genome had an average coverage of over 10 reads deep. The reads for each strain were assembled separately using Version 2.0.00 of the Newbler GS De Novo Assembler program (454 Life Sciences). The assembly was run via command line interface using the “-nrm”, “-consed”, and “-large” flags. All other parameters used were the default values, as described in the manufacturer's publication, “Genome Sequencer Data Analysis Software Manual”.
Open reading frames (ORFs) in each of the contig sequences for all four draft genomes were identified and annotated using RAST (Aziz et al. 2008). A small number of contigs [two contigs, ~1.3 kb, from the WH0401 genome, and 18 contigs (~14 kb) from the WH0005 genome] were removed from further analysis based on a lack of recognizable coding sequence and/or their lack of any homology to known cyanobacterial sequences. It should be noted that there is an unavoidable difficulty in ORF identification and annotation using multi-contig genomes. As such, there may be a small fraction of coding sequences in these genomes that were not properly identified, particularly at the ends of contigs. In addition to annotated ORFs, each genome contained a single rRNA operon and 39 tRNAs. The final number of bases and contigs in each genome, as well as the %G+C and number of annotated ORFs are listed in Table 2. The genome sequences and annotations deposited at DDBJ/EMBL/GenBank are publicly available at http://www.ncbi.nlm.nih.gov/ using the accession numbers listed in the Table 1.
Table 2. Genome sizes and gene content statistics for six Crocosphaera watsonii strains
| ||Strain||Total genome length (bp)||Average% G+C||Number of ORFs||Number of transposase genes||Number (and%) of strain-specific ORFsa||Number of strain-specific transposase genes||Number of contigs|
|Small-cell strains||WH8501||6,238,156||37.1||5,958||1,211||229 (3.8%)||71||320|
|Large-cell strains||WH0003||5,892,658||37.7||6,145||223||223 (3.6%)||9||1,130|
Transposase genes were identified and assigned to IS families using the BLAST tool on the ISfinder website (http://www-is.biotoul.fr/is.html) with default parameters (Siguier et al. 2006). ORFs with protein BLAST (BLASTp) E-values of <10−3 were annotated as transposases, and assigned to the IS family of the top BLAST hit. Also, a small number of ORFs (10–18 per genome) were annotated with the transposase function by RAST, but did not have qualifying BLAST hits to the IS finder database. These were included in all of the transposase analyses, such as genome counts, and IS family tallies. ORFs without a qualifying IS finder hit and RAST annotation lacking IS family information were listed as “unknown” in the IS family counts. A similar process was used to identify transposase genes in WH0003 and WH8501, with one additional pre-analysis step for the WH8501 genome which involved identification and grouping of highly duplicated sequences (see methods in Bench et al. 2011).
Comparisons between all six genomes were based on nucleotide BLAST of ORFs. The ORFs from each genome were used as queries in BLAST comparisons against the other five genomes, and the criteria used to classify an ORF as shared between genomes was >95% nucleotide identity over at least 70% of the ORF length. These criteria were based on the observation that the nucleotide sequences for shared ORFs were generally >99% identical, and fell off rapidly below 98% (Figure S1 in the Supporting Information). The same similarity criteria were used to cluster ORFs within a single genome into repeated sequence groups using the CD-HIT web server (Li and Godzik 2006, Huang et al. 2010). To assess similarities across all six genomes, six tables of BLAST results (one for each genome versus the other five genomes) were merged using custom software according to sequence similarity and binned by the genomes in which the sequence was present. From the original 34,455 ORFs in the six genomes, this process produced a non-redundant set of 11,635 sequences that represented all ORFs in the six genomes.
Those 11,635 ORFs were grouped into 63 unique sharing patterns based on their presence–absence observed in the six genomes. The 63 patterns were further aggregated into categories according to the number of genomes where the ORF was present: there are six patterns for ORFs present in only one genome, 15 possible patterns for the category of ORFs present in two genomes, etc. Equal similarity among the six genomes would result in random sharing and the expectation of equal ORF counts for each pattern within each category. For example, there were 1,727 sequences in the 15 patterns where ORFs were present in two genomes and, in the case of random sharing, the expected ORF count would be 115.1 (i.e., 1727/15) in each. Chi-squared goodness-of-fit test was used to assess the statistical significance of differences between the observed and expected ORF counts in each pattern. A similar analysis was performed on ORF counts for patterns grouped according to strain phenotypes. For example, the 1,727 sequences shared by exactly two genomes were aggregated into three groups with two groups where the two genomes share a phenotype (both are large cell or both are small cell), and one group of ORFs found in a genome of each type. The random ORF sharing hypothesis was also tested for the grouped patterns using the chi-squared goodness-of-fit test.
Analysis and PCR of specific groups of functional genes
In order to investigate the phylogenetic relationships of the six C. watsonii strains, nucleotide sequences from 25 ORFs were concatenated and aligned to construct a phylogenetic tree and distance matrix. The 25 genes were chosen using the following criteria: (i) They were present in all six strains, (ii) they had some variation between strains (i.e., the vast majority of 100% identical ORF sequences could not be used), and (iii) they had homologs in the two Cyanothece species used as the outgroup (sp. 51142 and CCY0110, which are the two most closely related genomes available, based on 16S rRNA similarity). Because the C. watsonii strains are very closely related, nucleotide sequences were compared, rather than translated amino acid sequences. This allowed the analysis to take into account all possible sequence variation, including synonymous third position changes. The sequence IDs for the original 200 sequences (25 ORFs from each of eight genomes) that were concatenated are listed in Table S1 in the Supporting Information. Eight of the 150 Crocosphaera ORFs were split into two sequences, either because they were on two contigs or by an internal stop codon, which probably arose from sequencing error. For these sequences, the two sequences are listed together in a single cell, in the order in which they were aligned. The sequences were manually concatenated into a single sequence for each genome, and the resulting eight sequences were initially aligned using ClustalX v2.0.11 (Thompson et al. 1994, 1997, Larkin et al. 2007), followed by some manual curation and phylogenetic tree construction in MEGA4 (Tamura et al. 2007). The Neighbor-Joining method, with 1,000 bootstrap replicates, was used to construct the phylogeny (Felsenstein 1985, Saitou and Nei 1987). Evolutionary distances for the tree and distance matrix were calculated based on the same alignment, using the Jukes-Cantor method in MEGA4 (Jukes and Cantor 1969, Tamura et al. 2007). For both the phylogeny and distance matrix, all codon positions were included (1st, 2nd, 3rd, and noncoding), and positions containing alignment gaps and missing data were eliminated only in pair-wise sequence comparisons (pair-wise deletion option). There were 22,611 positions in the final dataset.
Because of prior observations of differences between strains in photosynthetic efficiency (Sohm et al. 2011), per-cell N2-fixation rates (Webb et al. 2009), and phosphorus scavenging genes (Dyhrman and Haley 2006), ORFs with functions related to these processes were compared. All ORFs in the each of the six C. watsonii genomes were compared to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and given KEGG orthology (KO) assignments using the single-directional best hit method to assign orthologs via the web interface of the KEGG Automatic Annotation Server (KAAS, http://www.genome.jp/tools/kaas/; Moriya et al. 2007). Using the KO description and/or the RAST annotated function, genes with roles in the structure or function of the two photosystems, N2-fixation, and phosphorus transport/metabolism were identified. The number of variants of each gene was totaled for each C. watsonii genome using the same criteria used to create the table of 11,635 sequences described above (>95% ID over >70% of the ORF).
The observation of at least one phenotype-specific variant of the isiA gene led to a more detailed analysis of those ORFs. The C. watsonii psbC and isiA genes were used as protein BLAST query sequences against the NCBI nr protein database, and the ten most similar sequences to each were retrieved, followed by removing redundant sequences. The resulting set of protein sequences were aligned along with the C. watsonii ORFs (58 sequences total) using the online Multiple Sequence Comparison by Log-Expectation (MUSCLE, http://www.ebi.ac.uk/Tools/msa/muscle/) tool with default parameters. A phylogenetic tree was generated from the resulting alignment using the UPGMA method with 500 bootstrap replicates in MEGA5 (Sneath and Sokal 1973, Felsenstein 1985, Tamura et al. 2011). Evolutionary distances, as the number of amino acid substitutions per site, were computed using the Poisson correction method (Zuckerkandl and Pauling 1965). All ambiguous positions were removed for each sequence pair, and there were a total of 776 positions in the final dataset. In the WH0402 genome, two of the isiA genes were split into two adjacent ORFs by a stop codon, with adjacent ORFs homologous to adjacent regions of full-length isiA sequences, suggesting that the stop codons may have arisen from sequencing errors. In both cases, the adjacent ORFs were merged prior to alignment, and are so noted on the phylogenetic tree. It should also be noted that WH0401 is not shown in Clade 1 because the ORF was truncated and could not be properly aligned. As such, it is possible the full-length gene is not present in the WH0401 genome, but given the location of the ORF at the end of a contig, it is more likely missing due to the multi-contig, draft status of the genome. The psbC clade was identified by examining the sequence alignment for the ~114 amino acid region between the 5th and 6th transmembrane domains of the protein that is known to be absent from IsiA proteins (Laudenbach and Straus 1988, Bricker 1990). The synteny of the genomic regions was examined using BLAST comparisons against WH0005 contig 0012 which was the longest contiguous sequence containing all isiA genes.
Prior to the sequencing of any Crocosphaera genome other than WH8501, fosmid libraries were constructed for multiple C. watsonii strains. The whole genome sequences, which were determined shortly afterward, made detailed analysis of those libraries redundant, so they are not included in this study. However, initial analyses of fosmid end sequences identified a gene in the library from a large-cell strain that was not present in the WH8501 genome. Three different variants of this gene, peptidoglycan-binding LysM:Peptidase M23B (referred to hereafter as lysM), were present in the WH8501 genome, and the fosmid end sequence was a fourth variant with <90% similarity to the other three variants. Based on that finding, a PCR assay was developed which provided support for the genomic comparisons. One reverse PCR primer was designed to a conserved region of the lysM gene (complementary to all four gene variants), and individual forward primers were designed for each variant using Primer 3 (Rozen and Skaletsky 1999). The sequences for all primers and PCR product sizes are given in Table S2 in the Supporting Information. PCR reactions were carried out in 50 μL reactions, containing 2 μL of template DNA from eight cultures (four large-cell types and four small-cell types). Final reaction concentrations of reagents were as follows: 1× PCR buffer; 2% DMSO, 0.2 mM each of dNTPs; 0.4 μM of each primer, and two units of Platinum taq (Invitrogen, Life Technologies, Grand Island, NY, USA). Reactions underwent an initial heating step of 94°C for 90 s, then 30 cycles of: 94°C for 30 s, 56°C for 60 s, 72°C for 150 s, followed by a final extension step of 70°C for 5 min, and holding at 4°C. PCR reactions with bands of the expected size on an agarose gel were verified by subcloning and sequencing in the pGEM-T vector system (Promega, Madison, WI, USA) or by direct sequencing following reaction clean-up with QIAquick PCR Purification kits (Qiagen). Sanger sequencing reactions and electrophoresis were completed at the UC Berkeley sequencing center using a 3730 DNA Analyzer (Applied Biosystems, Foster City, CA, USA) according to manufacturer's protocols. The PCR results are summarized in Table S3 in the Supporting Information, and for the six strains which now have genome sequence data, corresponding ORF IDs have been added for those with a positive PCR result.