Fig. S1. Flow cytometry histograms of the pre-concentrated tropical Atlantic sample that was sorted. The arrows indicate the target population. Top panels are the sample as it appeared at sea during sorting. The right hand panel shows Synechococcus cells (R1), which have been gated out from the left hand panel. The bottom panel shows the sorted population as thawed and rerun on land where it was resorted for purity. R1 again shows the position where Synechococcus would be expected. Note cytometer set up is never identical between runs (even if strived for) hence position of populations in top and bottom panels is slightly different.

Fig. S2. Analysis of an 18S rDNA clone library from the sorted, MDA amplified population template. All of the 741 successfully sequenced 18S rDNA clones werephylogenetically placed with Bathycoccus. Prior to this maximum-likelihood analysis (PhyML) sequences were clustered at 99% identity (sequences were not manually curated) and a single representative of each of the six resulting clusters used in the alignment and tree. Eight hundred and thirty homologous positions were analysed after gap removal with 100 boot straps. After discarding gapped positions and ambiguous positions in the alignment, differences between Bathyoccus sequences were so few that it resulted in the observed polytomy in this reconstruction.

Fig. S3. Aqua MODIS chlorophyll concentration 1 February 2004–30 September 2005, the period spanning collection of sequenced GOS samples. Grey indicates land, white missing data, yellow circles indicate sites at which PHO4 was detected and black circles indicate those where no PHO4 were detected although the metagenome from the site was analysed.

Table S1. BLASTn results of cluster representatives from the population sort 16S rDNA clone library. The sequences appear to be distantly related to those in both NCBI environmental nucleotide (env_nt) and reference genomic sequence (ref_genomes) databases and most related to uncultured bacteria from the nucleotide non-redundant database (nt). Note clone library sequence assemblies were not manually curated.

Table S2a. Number of metagenome sequences and number of identified PHO4 sequences. Those under Total Pho4 were placed on the PHO4 tree (after retrieval using the HMM), but do not necessarily have statistical support for their respective placements. Those under Euk/Virus heading could result from hosts or their respective viruses, while those in the viral column were unambiguously assigned to viruses. For both these categories only sequences retaining placement support (with probability ≥ 0.75) are included. Total non-redundant ORFs are based on six-frame translation and minimum length of 40 and 60 amino acids for 454-FLX and Sanger respectively. Breakdown of read numbers is given for all sequencing technologies in Table S4.

Table S2b. Summary of PHO4 detected in the Global Ocean Sampling predicted protein datasets as deposited in CAMERA. All PHO4 were identified using the HMM and those in the columns Euk/Virus or Virus also had supported positions in the phylogenetic analysis (with probability ≥ 0.75). Sequences in Euk/Virus could result from hosts or their respective viruses, while those in the viral column were unambiguously assigned to viruses. Total reads does not reflected the number of predicted proteins and the number was not easily retrievable.

EMI_2576_sm_suppInfor.pdf632KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.