pico-PLAZA, a genome database of microbial photosynthetic eukaryotes

Authors


For correspondence. E-mail klaas.vandepoele@psb.vib-ugent.be; Tel. (+32) 9 33 13822; Fax (+32) 9 33 13809.

Summary

With the advent of next generation genome sequencing, the number of sequenced algal genomes and transcriptomes is rapidly growing. Although a few genome portals exist to browse individual genome sequences, exploring complete genome information from multiple species for the analysis of user-defined sequences or gene lists remains a major challenge. pico-PLAZA is a web-based resource (http://bioinformatics.psb.ugent.be/pico-plaza/) for algal genomics that combines different data types with intuitive tools to explore genomic diversity, perform integrative evolutionary sequence analysis and study gene functions. Apart from homologous gene families, multiple sequence alignments, phylogenetic trees, Gene Ontology, InterPro and text-mining functional annotations, different interactive viewers are available to study genome organization using gene collinearity and synteny information. Different search functions, documentation pages, export functions and an extensive glossary are available to guide non-expert scientists. To illustrate the versatility of the platform, different case studies are presented demonstrating how pico-PLAZA can be used to functionally characterize large-scale EST/RNA-Seq data sets and to perform environmental genomics. Functional enrichments analysis of 16 Phaeodactylum tricornutum transcriptome libraries offers a molecular view on diatom adaptation to different environments of ecological relevance. Furthermore, we show how complementary genomic data sources can easily be combined to identify marker genes to study the diversity and distribution of algal species, for example in metagenomes, or to quantify intraspecific diversity from environmental strains.

Algal genomics comes of age

One decade ago, sequencing of the 18S rRNA gene from environmental samples of the ocean surface unveiled an astounding diversity of planktonic microbial eukaryotes (Diez et al., 2001; Moon-van der Staay et al., 2001; Guillou et al., 2008; Massana and Pedros-Alio, 2008). Metabarcoding approaches (Tautz and Domazet-Loso, 2011) have repeatedly enabled the identification of new pico-eukaryotic (cell size < 2 μm) lineages with no cultured representatives (Goodstein et al., 2012; Not et al., 2012). It also revealed the ecological importance of microbial algae in coastal waters and the ecophysiological parameters responsible for their global distribution (Demir-Hilton et al., 2011). However, the limits of using a few barcoding genes to estimate diversity became increasingly apparent with the availability of complete genomes. The comparison of the first pair of genomes in the Ostreococcus genus, O. tauri (Derelle et al., 2006) and O. lucimarinus (Palenik et al., 2007), disclosed an unexpected divergence of their genome sequence, with over 15% of species-specific genes and high levels of protein divergence, despite a 99.8% identity over the complete 18S rRNA sequence (Piganeau et al., 2011). The comparative analysis of complete genome sequences will yield alternative, less constrained, genes that more accurately represent species' diversity in microbial eukaryotes (Slapeta et al., 2006; Piganeau et al., 2011), and thus provide a better understanding of their ecology.

The microalgal genomic era started with the publication of the genome sequence of the red alga Cyanidioschyzon merolae in 2004 (Matsuzaki et al., 2004), followed by projects focusing on the diatom Thalassiosira pseudonana (Armbrust et al., 2004) and the green alga O. tauri, (Derelle et al., 2006), which complemented large-scale expressed sequence tag (EST) projects (for a review, see Tirichine and Bowler, 2011). Fuelled by comparative genomics, recent sequencing initiatives have provided significant new insights into secondary endosymbiosis (Moustafa et al., 2009; Deschamps and Moreira, 2012), genome organization and compaction (Derelle et al., 2006; Palenik et al., 2007), intron evolution (Worden et al., 2009) and horizontal gene transfer (Bowler et al., 2008; Moreau et al., 2012) in different unicellular eukaryotic species. Although a few genome portals exist to browse individual genome sequences, exploring complete genome information from multiple species remains a major challenge. Consequently, performing evolutionary analyses using genome sequences generated by different labs or consortia requires a centralized infrastructure where all information is integrated, in combination with advanced, user-friendly methods for data mining.

pico-PLAZA is a web-based resource (http://bioinformatics.psb.ugent.be/pico-plaza/) for algal genomics with intuitive tools to explore genomic diversity, perform integrative evolutionary sequence analyses and study gene functions (for a complete tool overview and data types, see Table S1). Based on 16 algal genome sequences including 1 red, 1 brown and 10 green algae, as well as 5 stramenopile species (Fig. 1), detailed information is available describing genes, homologous gene families, multiple sequence alignments, phylogenetic trees, Gene Ontology (GO), InterPro and text-mining functional annotations (see Supporting Information Text and Figs S1 and S2). Furthermore, different interactive viewers are available to study genome organization using information on gene collinearity and synteny (conservation of relative gene order between species) (Figs S3 and S4). Various search functions, documentation pages, different export options and an extensive glossary are available to guide non-expert scientists during sequence analysis. Finally, pico-PLAZA's Workbench, a user-specific analysis environment, allows the efficient characterization of large user-defined gene sets or sequences. As such, it provides a means to use the integrated genome information as a reference knowledge base for the analysis of new sequence information such as transcriptomes from non-model species lacking complete genome sequences or functional information. Below, we highlight three examples illustrating how pico-PLAZA can be used to identify species-specific genes, functionally characterize large-scale EST or RNA-Seq data sets, and identify new marker genes to study the diversity and distribution of algal species.

Figure 1.

Content of the pico-PLAZA database. An overview of the genome sequence information available in the pico-PLAZA database (http://bioinformatics.psb.ugent.be/pico-plaza/). The phylogenetic tree is based on the NCBI Taxonomy database. Footnotes: (1) phylum (2) year of publication in parenthesis (3) AnnoMine is a novel homology-based text-mining approach to functionally annotate genes; ‘n.d.’ indicates not determined because gene descriptions were assigned by original data provider.

Disentangling the core genome and species-specific gene families

Searching gene families through phylogenetic profiles (i.e. the presence or absence of a gene family in a species) is important to understand gene family dynamics and shed light on both ancestral gene content (Merchant et al., 2007) and new gene birth (Tautz and Domazet-Loso, 2011). The pico-PLAZA Gene Family Finder tool allows the number of genes assigned to families in all available algal genomes to be compared, and can reveal large differences for individual green algal species. Whereas O. tauri contains 6893 genes grouped into families, Volvox carteri has more than 13 800 genes. The identification of orphan genes (genes lacking paralogs or homologous genes in any other species present in the database), gene family sizes (single-copy versus multi-gene) and species content (species-specific or genes sharing homologues in other species) provides a general overview of gene distributions in the different species (Fig. 2A). The Ostreococcus species have the most streamlined genomes characterized by a large number of single-copy gene families and a low number of multicopy families and orphans. In contrast, the largest number of genes in multicopy families is observed in V. carteri and Chlamydomonas reinhardtii. There are many species-specific gene families, and 1827 gene families are shared and unique to the Chlorophyceae (Table 1). These results explain the overall high number of genes present in these two species. Counting duplicated genes for all species reveals an important role for tandem duplication (between 3% and 15% of all genes for different species) as the molecular mechanism implied in gene family expansions (Fig. S5).

Figure 2.

Overview of gene content in different algal genomes.

A. Fraction of protein-coding genes assigned to different categories based on homologues in other species and copy number.

B. Pan and core genome plots. Starting from the first species (Volvox carteri, left), the number of genes with homologues in other species (right) is scored as ‘core’ whereas the number of new genes without homologues in the included species (left from current species) are scored as ‘new pan genes’. ‘Pan genes’ indicates the sum of all ‘new pan genes’ based on the species already included. The number of core genes at specific taxonomic levels is indicated. The left Y-axis covers the number of core and pan genes, the right Y-axis reports the number of new pan genes.

Table 1. Clade-specific gene families
Clade-specific core familiesa# gene families
  1. aNumbers in parenthesis indicate the number of included species.
Land plants (3)974
Green algae (10)37
Chlorophyceae (2)1827
Trebouxiophyceae (2)139
Mamiellophyceae (6)449
Diatoms (3)1035

In contrast to species-specific features, determining the number of core genes (i.e. genes shared in all species from a clade or specific set of species) within green algae revealed that 2078 core families are shared between all 10 species (Fig. 2B). When including brown/red algae or higher plants, this number decreases to 1494 and 1089 respectively. Considering both the large number of new pan families (i.e. new families not observed in a specific set of organisms) as well as clade-specific families (Table 1), it is possible to weight the importance of the acquisition of new gene functions and the expansion of specific gene families in the relationship between genotypic diversity and algal phenotypes (Blanc et al., 2012; Moreau et al., 2012). Examples of expanded functional categories include proteins with ankyrin repeat-containing domains in Ectocarpus siliculosus and Bathycoccus prasinos, protein kinases in C. reinhardtii, and tetratricopeptide-like helical proteins in E. siliculosus and Aureococcus anophagefferens.

Efficient exploration of gene function diversity in large-scale expression data sets

Apart from browsing individual genes or functional categories, pico-PLAZA can also be applied as a data warehouse to analyse large gene sets or characterize new sequences. To demonstrate this feature, we performed a functional and comparative analysis of a set of > 10 000 EST sequences from Phaeodactylum tricornutum using the Workbench. Based on a large-scale expression data set of > 120 000 sequenced cDNAs from 16 different libraries (Maheswari et al., 2010), we created two Workbench experiments for each library. One experiment comprises all sequences expressed in that condition independent from their expression in other conditions (called condition_all), while the second experiment covers sequences uniquely expressed in that condition (called condition_specific). The 16 libraries explore the responses of P. tricornutum to a range of growth conditions, including different nutrient regimes of Si, N, Fe, and dissolved inorganic carbon, stress (hyposalinity and low temperature), and blue light.

Results about associated genes, families and functional GO enrichment analysis for all 32 experiments are summarized in Tables S2 and S3. We further present a detailed analysis of sequences from the ‘urea adapted (ua)’ library. After mapping all 3436 ‘ua’ sequences to the genome annotation of P. tricornutum (BLASTN against annotated transcripts; E-value < 1e-05 via the Workbench), a total of 2863 gene models were tagged with one or more EST sequences. Ninety-four per cent of these genes are associated to 1954 pico-PLAZA multigene families, and a detailed analysis of the phylogenetic family profiles reveals that 69 (4%) and 441 (23%) families are specific to P. tricornutum and diatoms respectively. Interestingly, the latter includes a family of S-adenosylmethionine decarboxylases (HOM004619) involved in spermidine biosynthesis that was putatively acquired through horizontal gene transfer from a bacterial donor (Maheswari et al., 2010). GO enrichment analysis (Fig. S6) of the ‘ua_all’ gene set reveals an overrepresentation of genes involved in nitrogen (405 genes), amino acid (117 genes) and organic acid metabolism (132 genes), confirming that diatoms can use urea as a nitrogen source (Armbrust et al., 2004; Maheswari et al., 2010). Interestingly, five ‘ua_specific’ gene families have only homologues in diatoms and therefore comprise diatom-specific genes playing a role in urea-mediating signalling. These functional enrichment results offer a molecular view on the adaptation of P. tricornutum to different environments of ecological relevance. Furthermore, the possibility of combining diatom-specific gene families with specific transcriptional responses under different nutrient, stress and light regimes provides an entry point to link currently unknown genes with unique phenotypic features (Gollery et al., 2006).

Sieving complete genomes for new barcoding genes

Based on the different integrated genomes, precomputed gene families and detailed gene orthology information (Fig. S2), pico-PLAZA enables a systematic screen of the gene content from complete genomes of microbial photosynthetic eukaryotes. This permits the identification of alternative barcoding genes to screen metagenomes and to address issues about the diversity and distribution of microbial algae. These candidate marker genes should preferably be single-copy genes with a scalable phylogenetic spread from the genus to the order and phyla level. Although this case study is currently restricted to the lineages represented in pico-PLAZA, the number of available genomes will rapidly increase because of future genome projects of microbial eukaryotes and large-scale sequencing initiatives such as the Tara Oceans protist sequencing project (Karsenti et al., 2011) and CAMERA (Sun et al., 2011).

To identify lineage-specific genes for environmental monitoring, the Gene Family Finder tool can be used to find species or clade-specific gene families and identify putative gene markers. For example, 442 protein-coding genes are single copy in all three Ostreococcus species (option ‘Clade selection: Ostreococcus’) and absent in Micromonas and Bathycoccus (Table S4). The single-copy feature of a candidate barcoding gene is essential to avoid spurious diversity overestimation from multiple gene copies within a genome. Performing a query on single-copy genes in the order Mamiellales leads to the retrieval of 328 gene families. For each of these gene families, visual inspection of the amino-acid alignment using the JalView editor (University of Dundee, Dundee, Scotland, UK) (Waterhouse et al., 2009) enables the identification of conserved motifs for Polymerase Chain Reaction (PCR) primer design. This two-step protocol provides a practical approach for the detection of genes that can be used to investigate the prevalence of Ostreococcus (or Mamiellophyceae) and their distribution in the ocean. Protein-coding gene markers may enable intraspecific and interspecific diversity to be investigated alongside one another by using appropriate constraints on synonymous coding positions.

As a second example, we demonstrate how pico-PLAZA can be used to identify intraspecific markers based on multispecies collinearity. The level of nucleotide polymorphism at neutrally evolving sites is a fundamental parameter in molecular evolution, as it is informative about the mutation rate and the effective population size of a species. The proportion of neutrally evolving sites is expected to be lower in protein-coding genes than in intergenic regions. In Ostreococcus, intergenic regions flanked by two stop codons (called ‘tail-to-tail’ intergenic regions) have the highest proportion of neutrally evolving sites (Piganeau et al., 2009). Using the GenomeView genome browser (Broad Institute of MIT and Harvard, Cambridge, MA, USA) (Abeel et al., 2012), pico-PLAZA enables tail-to-tail intergenic regions to be identified rapidly in each genome. Furthermore, cross-species collinearity information (Fig. S4) provides detailed information about conserved intergenic regions that are flanked by orthologous genes. These regions are good candidates for the estimation of intraspecific diversity from environmental strains. The application of this approach to the genomes of Ostreococcus guided the choice of eight tail-to-tail intergenic regions for use as markers to estimate the level of nucleotide polymorphism in O. tauri in the Northwest Mediterranean (Grimsley et al., 2010). The spectrum of polymorphism observed in these sequences provided indirect evidence for meiotic recombination, a key process of adaptation in natural populations.

Conclusions

pico-PLAZA provides an unparalleled set of data types and tools for comparative genomics and data mining in algae. Future efforts will be made to extend the number of available algal species and to include novel data types to study gene function and regulation. Overall, pico-PLAZA represents a useful toolkit to aid researchers in the exploration of the diversity and evolution of algal genomes through a comprehensible web-based research interface.

Acknowledgements

We would like to thank Pierre Rouzé, Stephane Rombauts and Evelyne Derelle for general feedback and Sebastian Proost for technical i-ADHoRe support. SVL would like to thank the Research Foundation Flanders (FWO) for funding her research. This work was supported by the Multidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to networks’ Project (no. 01MR0410W) of Ghent University and Agence Nationale de la Recherche grant PHYTADAPT n° NT09_567009.

Ancillary