• The tropical intertidal ecosystem is defined by trees – mangroves – which are adapted to an extreme and extremely variable environment. The genetic basis underlying these adaptations is, however, virtually unknown. Based on advances in pyrosequencing, we present here the first transcriptome analysis for plants for which no prior genomic information was available. We selected the mangroves Rhizophora mangle (Rhizophoraceae) and Heritiera littoralis (Malvaceae) as ecologically important extremophiles employing markedly different physiological and life-history strategies for survival and dominance in this extreme environment.
• For maximal representation of conditional transcripts, mRNA was obtained from a variety of developmental stages, tissues types, and habitats. For each species, a normalized cDNA library of pooled mRNAs was analysed using GSFLX pyrosequencing.
• A total of 537 635 sequences were assembled de novo and annotated as > 13 000 distinct gene models for each species. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology annotations highlighted remarkable similarities in the mangrove transcriptome profiles, which differed substantially from the model plants Arabidopsis and Populus.
• Similarities in the two species suggest a unique mangrove lifestyle overarching the effects of transcriptome size, habitat, tissue type, developmental stage, and biogeographic and phylogenetic differences between them.
The mangrove ecosystem is defined by a group of halophytes, largely trees, that dominate tropical intertidal zones and estuaries. The term itself refers both to the ecosystem and to the individual trees and tree species. The mangrove environment is an extreme one, characterized by prolonged and sometimes deep flooding, but also by prolonged periods (especially during neap tides) of drying soil, root zone anoxia, high temperatures, hurricane force wind, and high and extremely variable salt conditions in typically resource-poor environments. As organisms evolutionarily adapted to thrive in these extreme conditions, mangroves are true extremophiles (c.f. Inan et al., 2004). The genetic basis for these characters is, however, virtually unknown: mangroves and other extremophiles, indeed most nonmodel plants, are very poorly represented in the plant molecular literature. Thus, they remain an untapped resource for understanding and exploiting plant adaptations to extreme environments.
Despite their common grouping as ‘mangroves’, mangrove taxa are biogeographically and taxonomically diverse. There are several interpretations for the origin of mangroves, but a consensus based on fossil evidence is that mangroves originated during the late Cretaceous near the Sea of Tethys which separated the supercontinents, Laurasia and Gondwanaland (Plaziat et al., 2001; Saenger, 2002). Mangroves today are represented in at least 20 families (Duke et al., 1998), and include ferns, monocots and dicots. With a multitude of structural adaptations reflecting responses to common environmental constraints, the mangrove community exemplifies one of the stronger cases for convergent evolution in the plant kingdom (Tomlinson, 1986; Ellison et al., 1999). Convergent evolution, however, has not led to physiological uniformity. With respect to salt handling, for example, the physiological strategies range from salt excreters (e.g. Avicennia, Aegicaras, Sonneratia), to salt regulators (e.g. Rhizophora, Bruguiera, Xylocarpus) to hyper-excluders (e.g. Heritiera) (Scholander et al., 1962; Flowers et al., 1977; Paliyavuth et al., 2004). In the two focal species of this study, this is reflected in the sodium (Na+) contents of leaves. Rhizophora mangle lacks salt glands or other excretory mechanisms, and controls salt entry through the roots, but nevertheless accumulates Na to > 500 mm (tissue water basis), and maintains a Na : potassium (K) ratio of c. 4 :1 in full seawater (J. M. Cheeseman, unpublished; Popp, 1984 presents similar data for Australian rhizophoracean mangroves). Heritiera littoralis, by contrast, also lacks salt glands, but even in hypersaline conditions, the leaves contain < 50 mm Na with Na : K ratios of 0.5 : 1 or less (Popp, 1984, J. M. Cheeseman, unpublished; Paliyavuth et al., 2004).
Whether the goal is to elucidate the genetic basis for the physiological differences, or to exploit the group's unique genetic resources, much greater genome-level understanding is needed. Particularly in an era of rapid global change, the need for such studies has been increasingly recognized (Reusch & Wood, 2007; Karrenberg & Widmer, 2008). Sequencing of complex genomes remains challenging, however (Pop & Salzberg, 2008), and can be problematic during assembly, even when considering current advancements of sequencing technologies. This applies especially to nonmodel species such as mangroves for which genomic information is scarce. As an alternative, sequencing transcriptomes entails less complexity during assembly while specifically identifying expressed genes. 454/Roche GSFLX pyrosequencing provides a versatile platform with long reads, exceptional accuracy, and ultra-high throughput sequencing compared with older sequencing strategies (Droege & Hill, 2008). Transcriptome analysis with pyrosequencing for model organisms (Weber et al., 2007; Torres et al., 2008) and plant species with extensive expressed sequences tag (EST) data, has demonstrated the suitability of this method in providing deep representation of transcripts (Cheung et al., 2006; Barbazuk et al., 2007; Swaminathan et al., 2007).
Here, we have begun an in-depth analysis of the transcriptomes of R. mangle and H. littoralis. We chose these species based on their differing physiological strategies, their distinct biogeographic distributions (neotropical and Indo-West Pacific, respectively), and their distinct and evolutionarily distant phylogenetic positions (R. mangle is more closely related to Arabidopsis and H. littoralis is more closely related to Populus). We began with RNA collected from a wide variety of tissues and environmental conditions in nature and the glasshouse. We report transcript profiles obtained by 454/Roche GSFLX pyrosequencing and subsequent assembly and global annotation. Similarities are evident between the two mangroves, and the differences in representation of transcript gene ontology (GO) and KEGG (Kyoto Encyclopedia of Genes and Genomes) orthology categories that distinguish them from model plants and point to a unique mangrove ‘lifestyle’ (sensu Melzer et al., 2008).
Materials and Methods
Plant material was harvested from both the glasshouse and the field. Tissue samples were immediately stored in liquid nitrogen or RNAlater (Applied Biosystems/Ambion, Austin, TX, USA) until further processing. RNAlater was used according to the manufacture's instructions with tissue to volume ratios of < 100 mg ml−1.
Rhizophora mangle L. field samples were collected at Twin Cays, Belize, a peat-based mangrove archipelago 12 km from the coast of Belize, just inside the Mesoamerican barrier reef (Feller et al., 2003). Tissues represented included leaves, roots, hypocotyl peels, young and mature propagules (viviparous seedlings still attached to the mother plant) and flower buds of stunted and tall individuals and P-fertilized stunted plants. The propagules for glasshouse plants were obtained from the same field site. The glasshouse samples included young leaf buds and shoot meristems, mature buds, stipules, young leaves, mature leaves, senescing leaves (early stages), young stem, fine roots, old, thickened roots, mature stem bark, and prop root tips from plants growing at salinities ranging from c. 2–100% of full seawater. Collectively, 68 different tissue types, growth conditions, and development stages were extracted for R. mangle. Heritiera littoralis Dryand. tissue samples were taken from young and mature leaves, roots, buds, and young stems of 3-yr-old saplings grown in the glasshouse at c. 25% of full seawater salinity. The seeds used originated from an estuarine population on the southwest coast of Sri Lanka.
Total RNA was isolated using the Plant RNA Isolation Mini Kit (Agilent, Santa Clara, CA, USA). RNA samples were treated with recombinant DNase I (TURBO DNase; Ambion) at 1.5 units µg−1 of total RNA, and further processed with Norgen RNA clean-up and concentration kits (Thorold, Ontario, Canada). Equal amounts of mRNA from different tissue types were pooled for each species. Total RNA purity and degradation were checked with 0.8% agarose gels and with the use of an Agilent 2100 Bioanalyzer before proceeding. The RNA samples for each species were pooled for subsequent procedures.
cDNA synthesis and normalization
Approximately 200 µg of total RNA were used to extract mRNA using Oligotex mRNA mini Kits (Qiagen, Valencia, CA, USA). Subsequently, 0.5 µg of mRNA for each species was converted to cDNA using the SMART cDNA synthesis protocol (Clontech, Mountain View, CA, USA). Long poly(A:T) tails in cDNA synthesis have, until recently, led to low quantity and quality sequence reads with the Genome Sequencer FLX system. This limitation was successfully overcome by a combination of modified amplification reactions and primers designed at the WM Keck Center, University of Illinois, Urbana, IL, USA. The modified poly(T) primer includes other nucleotides interspersed in the poly(T): (TAGAGACCGAGGCGGCCGACATGTTTTGTTTTTTTTTCTTTTTTTTTTVN). For cDNA synthesis, this primer was used in combinations with the 5′ rapid amplification of cDNA ends (RACE) SMARTIV primer (Clontech).
To improve coverage and sequencing of rare transcripts, cDNAs were normalized with a Trimmer Direct Kit (Evrogen, Moscow, Russia). cDNAs were denatured and allowed to self-anneal in a hybridization reaction for a period of 4–6 h. Within this period, most of the abundant transcripts are assumed to pair with their homologs while the unique/rare transcripts and their homologs remain single stranded. After hybridization, duplex/double stranded specific nuclease was added to the reaction to degrade ds-cDNAs. Polymerase chain reaction (PCR) was then used to reamplify the single-stranded transcripts and their homologs, providing the pool of normalized dsDNAs.
Library preparation and sequencing
The cDNAs were nebulized and selected for an average size of 400–500 bp. The FLX specific adapters, AdapterA (GCCTCCCTCGCGCCATCAG) and AdapterB (GCCTTGCCAGCCCGCTCAG), were ligated to the cDNA ends after end-polishing reactions, resulting in AdapterA–DNA fragment–AdapterB constructs. The adapter ligated DNAs were then mobilized to library preparation beads to capture the ssDNAs used for clonal amplification in emulsion PCR (emPCR). AdapterA sequences were used as the sequencing primer; AdapterB sequences were used to bind to the homolog sequences present at the surface of the emPCR beads. The emPCR was carried out using emPCR Kit II according to the Roche amplicon procedure. Biotinylated Adapter A, added during ssDNA construction, was used to facilitate capture and recovery of all DNA positive beads using streptavidin-coated magnetic beads. Amplified beads were loaded into a 70 × 75 mm PicoTiterPlate (PTP). Loading was followed by addition of packing beads and enzyme beads, and sequencing was carried out with an LR70 sequencing kit (Roche). The PTP was then placed onto the 454/Roche GSFLX (Roche Applied Science, Indianapolis, IN, USA) and bases (TACG) were sequentially flowed across the plate (100 cycles). A preliminary titration run was performed to determine the optimum reaction conditions; this was followed by a bulk sequencing run.
Adapter sequences were trimmed using inhouse Perl scripts and any remaining sequences below 20 bp in length were discarded. De novo sequence assembly was done combining titration and bulk run sequences for each species. Contigs were assembled with at least 40 bp overlap and 90% identity. Singlets with > 75%, and contigs with > 50% homopolymer regions were discarded. We used sequences selected to be of ‘high quality’ (> 99.5% accuracy on single base reads) by GSFLX pyrosequencing software to be assembled into contigs. Two programs were tested for contig assembly, the Newbler assembler provided with the GSFLX sequencer, with a quality score threshold set at 40, and the Phrap assembly program (http://www.phrap.org) with quality scores greater than 20. A Phrap score of 20 (Phrap 20) corresponds to 99% accuracy for a given base in an assembled sequence.
Sequence annotation was based on a set of sequential blast searches (Altschul et al., 1997) designed to find the most descriptive annotation possible for each sequence. The first blast search was performed with the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/) nonredundant (nr) protein database limited to Arabidopsis thaliana, as the A. thaliana genome annotation is the most advanced and complete for any higher plant to date. Sequences that did not show a match were then searched against the nr protein database limited to all plants. Next, the sequences were searched using blastn, first against A. thaliana and then against all plants. For the contigs and singlets above 200 bp, the blastx and blastn searches were limited to results with e-values lower than 10−3 and 10−4, respectively. The blast searches for singlets below 200 bp were carried with an e-value cutoff of 10−5. In practice, the e-values for more than 90% of the annotated sequences were < 10−10. A final blastn search against all sequences in nr was performed for sequences that did not have a match in any of the previous searches. That set was also searched with blastn against the NCBI EST and Environmental Samples databases. Inhouse Perl scripts (available on request) were used to parse blast outputs. The GO annotations were assigned based on the similarity to A. thaliana sequences; KEGG pathway annotations were assigned based on appropriately annotated plant reference genomes in NCBI.
Library preparation for Sanger sequencing and EST mapping
The cDNAs (100 ng) prepared for GSFLX sequencing were cloned nondirectionally into pCRII-Blunt-TOPO vector (Invitrogen, Carlsbad, CA, USA). Plasmid DNA was prepared following heat lysis according to the manufacturer's protocol. Sequencing reactions were carried out in a 1/32 BigDye reaction (Applied Biosystems). Sequencing was performed from the 5′-end of the cDNA in an ABI3730 capillary sequencer using M2 primer (5′-AAGCAGTGGTATCAACGCAG-3′) (Evrogen). Resulting EST sequences were trimmed from vector sequences and compared with GSFLX ESTs using clustalx (Thompson et al., 1997).
The combined titration and bulk GSFLX runs representing the two normalized cDNA libraries resulted in 232 264 and 305 371 sequence reads for R. mangle and H. littoralis, respectively (Table 1). After removing low-quality sequences and trimming adapter sequences, the average sequence read length was 208 bp. This was sufficient to circumvent problems of homopolymer generation and to enable annotations with fewer errors (Pop & Salzberg, 2008). All sequences have been deposited at the NCBI, and can be accessed in the Short Read Archive (SRA) under the project accession number SRA002286.3.
Table 1. Number of sequences in contigs and singlets for Heritiera littoralis and Rhizophora mangle
‘Sequences’ are the raw numbers resulting from GSFLX output. ‘Contigs’ are longer continuous gene models resulting from Phrap assembly; ‘reads per contig’ indicates the number of individual but overlapping sequences included in the contig. ‘Singlets’ are annotated sequences not included in contigs.
Total number of sequences
Number of sequences removed owing to low quality
Total number of sequences remaining
Number of sequences in contigs
Number of contigs
Number of contigs removed owing to homopolymers
Number of contigs remaining
Average contig size
Average number of reads per contig
Total number of singlets
Number singlets removed owing to homopolymers
Number of singlets remaining
Singlets above 200 bp
As Sanger-type sequencing remains the standard for EST sequencing, to evaluate the accuracy of our GSFLX-derived sequences, the normalized cDNA used to prepare the GSFLX library was used to construct an additional cDNA library. For each species, a random set of 10 cDNAs from this library was selected, cloned and analysed by the dye terminator sequencing method (ABI3730 sequencer), and mapped to the corresponding GSFLX sequences. The alignments between Sanger ESTs and GSFLX ESTs showed 97 ± 0.02% identities with 0.3 ± 0.67 % gaps. This indicated that pyrosequencing now provides a level of accuracy comparable to Sanger sequencing and enabled subsequent sequence annotations.
De novo contig assembly
Common approaches to analyse a GSFLX-generated transcriptome of a virtually unknown genome include mapping the ESTs to a closely-related model genome that has been sequenced, or using existing, extensive, EST databases. In the absence of either of these for mangroves, de novo assembly was carried out with Newbler and Phrap assembly programs. Both programs generated comparable contigs, however, for a given contig, where Newbler generated uncalled bases (Ns), Phrap, eliminated uncalled positions with a consensus base. In addition, studies evaluating the accuracy and reliability of Phrap have demonstrated that the program is superior to others in generating homogeneous contigs in nonrepetitive regions (Rieder et al., 1998; Mavromatis et al., 2007; Phillippy et al., 2008). In practice, this means that the potential errors of mis-assemblies and chimeric contigs are minimized when Phrap is used on ESTs. Therefore, Phrap assembly was selected for our analysis (Table 1).
Following assembly, the average contig size was > 350 bp (Fig. 1), which is sufficient to assign functional annotations effectively. A total of 67 375 R. mangle and 96 989 H. littoralis sequences (contigs plus singlets) were used for annotations and further analysis. In R. mangle, the longest 10% of the contigs were greater than 830 bp, while only 0.6% were < 100 bp. In H. littoralis, the longest 10% were greater than 675 bp and 1.4% were < 100 bp.
Despite normalization, a few ESTs were sequenced and annotated hundreds of times. In H. littoralis, for example, the metallothionein 2a annotation was returned 487 times, and in R. mangle, a Pfkb-type carbohydrate kinase family protein was annotated 861 times. Within the top 20 most frequently returned annotations there were a number in each species that were totally absent from the data sets for the other (Table 2). Expressed sequence tags matching a mitochondrial transcription termination factor family protein (GI 18415647) and ubiquitin-protein ligase (GI 18395424), by contrast, were sequenced > 100 times in both mangroves. A particularly interesting case was Bg70, which was annotated 756 times in R. mangle, but absent from H. littoralis. This is a gene family of unknown function previously reported only from the mangrove Bruguiera gymnorrhiza (Rhizophoraceae) (Banzai et al., 2002). In microarray and real-time PCR studies of this species, the family has been reported as highly expressed in salt-treated plants (Miyama et al., 2006; Miyama & Hanagata, 2007; Liang et al., 2008).
Table 2. The most frequently sequenced transcripts of each mangrove species
Best match GI number
Number of ESTs [rank]
Number of sequences
Number of ESTs [rank]
Number of sequences
Based on the sequential BLAST annotation,each protein/gene was identified with a unique GI number assigned by National Center for Biotechnology Information (NCBI). The number of expressed sequence tags (ESTs) refers to the number of sequence reads in the transcriptome. Number of sequences refers to the number in the annotated set after assembly and includes both contigs (first value in parentheses) an d singlets (second value). Numbers in square brackets denote the rank with respect to all sequences for the species, i.e.  was the third most frequently sequenced.
Frequently sequenced in H. littoralis only
Unknown (Populus trichocarpa × Populus deltoides)
Gossypium barbadense chloroplast DNA
MIOX1 (myo-inisitol oxygenase); oxidoreductase
Frequently sequenced in R. mangle only
Bg70 (Bruguiera gymnorrhiza)
Beta lactamase (synthetic construct)
LTP4 (lipid transfer protein 4); lipid binding
Phosphorylase family protein (Arabidopsis thaliana)
Mitochondrial transcription termination factor family protein/MTERF family protein
Annotation and classification of sequences into classes
Our annotation approach (see the Materials and Methods section) was based on sequence homology searches and the annotations accompanying them. It aimed to capture the most informative and complete annotation possible. Once completed, we grouped all sequences according to the extent to which they could be reconciled with sequences in public databases (i.e. based on what was ‘known’ about a given sequence). All sequences which could be assigned functional interpretations were categorized as known-knowns (reconciliation class R1). Overall, 23 843 R. mangle and 30 594 H. littoralis sequences were assigned to this class (approx. 33%). Second, sequences that had been reported in other species, but designated as unknown, hypothetical, unnamed, predicted or carrying a clone number without further information, were categorized as known-unknowns (class R2). The R2 class comprises 8441 R. mangle and 9629 H. littoralis sequences (approx. 12%). Finally, all sequences that did not show any similarity to other sequences in GenBank within our e-value criteria were identified as unknown-unknowns (class R3). Nearly 55% of the sequences (35 091 in R. mangle and 56 766 in H. littoralis) fell into class R3.
Of the sequences identified to classes R1 and R2 based on the NCBI nonredundant (nr) plant database, c. 80% were annotated based on a GenBank nucleotide or protein annotation for A. thaliana (Table 3). Additional gene models were assigned based on the NCBI RefSeq genome database: for R. mangle, 13 049 distinct gene models were found for 26 928 sequences, and for H. littoralis, 13 598 gene models were found for 31 284 sequences. The remaining sequences with R1 and R2 annotations, 5356 for R. mangle and 8939 for H. littoralis, did not have a match with any gene models in reference genomes.
Table 3. Summary of the annotation sources for mangrove sequences in groups R1 (known knowns) and R2 (known unknowns)
Values are per cent of total annotations. EST, expressed sequences tag.
For the R. mangle R2, 20% of the sequences matched EST sequences from B. gymnorrhiza. Thirty-four per cent of the H. littoralis class R2 sequences shared similarity with Gossypium ESTs; both Gossypium and Heritiera are in the Malvaceae. In neither case did these sequences share homology with Arabidopsis within our annotation threshold parameters in blast searches. Interestingly, fewer than 1% showed similarity to sequences from nonplant sources. That the contig pool was free from substantial contamination by small RNAs (e.g. tRNAs, rRNA, plastid RNAs, and possible prokaryotic contaminating RNAs) was also verified by the annotation protocol; such sequences were represented in the overall data set at < 0.003%.
The GO classification system allows descriptions of gene products in terms of their associated biological processes, cellular components and molecular functions. Currently, GO functional interpretations for plants are entirely based on A. thaliana. Therefore, even if mangrove sequences share functional similarity with a known plant sequence, if it is not with Arabidopsis it would be excluded in the GO functional assignments. In addition, there are sequences (2747 for R. mangle and 6587 for H. littoralis) that were associated with a function during annotation, but which are not assignable to any GO category (e.g. B-type cyclin (Nicotiana tabacum), GI 849074). Overall, 22 596 R. mangle and 26 034 H. littoralis sequences could be assigned to GO categories. More than half of those, (10 114 R. mangle and 11 767 H. littoralis) had an assignment in all three GO major categories. Sequences to which GO categories were assigned had the greatest representation in GO ‘Molecular Function’ (Fig. 2). There were twice as many ESTs shared between ‘Molecular Function’ and ‘Biological Process’ as between ‘Cellular Component’ and either of the other two classes.
For each GO lineage, we compared R. mangle and H. littoralis sequences with the genome-wide GO assignments for A. thaliana (http://www.arabidopsis.org/) and Populus trichocarpa (http://genome.jgi-psf.org/cgi-bin/ToGo?species=Poptr1_1). Figure 3 shows the percentage of transcripts/gene models for each species, in the major GO category ‘Biological Process’. The two mangroves display almost identical profiles, but are noticeably different from the model plants. Similar results were obtained with transcript profile comparisons for GO ‘Cellular Component’ and ‘Molecular Function’ (see the Supporting Information, Fig. S1). Both the striking similarity between the mangroves and their considerable difference from the model plants were even more remarkable given the phylogenetic relationships of these species: Populus and R. mangle are grouped in the clade eurosids I, while Arabidopsis and H. littoralis are in the clade eurosids II. Given the environment in which they live, the mangroves also display a notable lack of representation in the response to stress and response to stimuli categories. In the case of R. mangle in particular, the extensive sampling from a wide variety of field conditions should have assured that these transcripts, if present, would be well represented.
The KEGG orthology (KO) is a classification system that provides an alternative functional annotation of genes by their associated biological pathways. The KO annotations for the mangroves were based on sequence similarity searches to reference sequence genomes (RefSeq) at NCBI. Overall, 2246 R. mangle and 2590 H. littoralis sequences were assigned to KOs, of which only 397 R. mangle and 468 H. littoralis sequences, also had all three GO classes assigned.
Figure 4 compares the mangrove KO annotations with Populus genome KO annotations (http://genome.jgi-psf.org/cgi-bin/metapathways?db=Poptr1_1). P. trichocarpa provides the second dicot genome to be completely sequenced and is annotated almost to completion. In almost all KO pathways examined, the two mangroves had similar representation in the number of distinct annotations within each subpathway, sometimes with a very different representation from Populus. Those differing by more than a factor of two are labeled in the figure, and many of these (marked with asterisks) are pathways related either to energy metabolism, especially in low O2 environments (e.g. ‘synthesis and degradation of ketone bodies’) or pathways associated with photoprotection and reactive oxygen species (ROS) scavenging (e.g. ‘terpenoid biosynthesis’) or repair to components of the light processing systems (e.g. ‘photosynthesis’).
Mangroves occupy a common, extremely challenging and variable habitat, but they are by no means behaviorally or physiologically homogeneous. In this study, we chose two species with distinctly different life history and physiological strategies for stress tolerance. Rhizophora mangle (Rhizophoraceae), the neotropical red mangrove, is considered a keystone plant species (Eddy & Faud, 1996; Proffitt & Travis, 2005) as well as an ‘ecosystem engineer’ (Crooks, 2002). At Twin Cays and throughout the islands of the Mesoamerican reef, the substrate is peat, derived from mangroves, to a depth of 10 m, without any mineral soil component. The islands were constructed of and by the mangroves over a period of the last 10 000 yr (Wooller et al., 2004). Rhizophora mangle commonly dominates the landscape, often forming near-monocultures.
Physiologically, R. mangle is characterized as a nonsecreting, salt-including halophyte. It also shows vivipary, with the fertilized ovary growing directly into a seedling (its propagule) while remaining attached to the mother plant. One of its major defense strategies against pathogens, herbivores, UV and oxidative stresses is based on the remarkable accumulation of phenylpropanoids, particularly proanthocyanidins that constitute up to 25% of leaf dry weight (Kandil et al., 2004). By contrast, H. littoralis (Malvaceae), the Indo-West Pacific looking-glass mangrove, is a eudicot (Chase, 2003). While it is capable of growing in full strength sea water, it is generally found at the terrestrial edge of the mangrove zone and generally isolated and scattered in the community. In contrast to R. mangle, H. littoralis displays a unique but poorly understood mechanism for extreme salt exclusion from its leaves even in hypersaline substrates.
Our goals in this study were, first, to begin to assemble the most complete representation and annotation of the mangrove transcriptome possible, and second, to use the results to extract characteristics of mangrove transcriptomes by comparing phylogenetically diverse transcript populations that might reflect their evolutionary convergence (i.e. in the sense of Melzer et al. (2008), a mangrove ‘lifestyle’). We see these goals as an important step and contribution toward the broader goal of interweaving ecological and molecular understanding in nonmodel systems, a need which is receiving increased recognition (Reusch & Wood, 2007; Karrenberg & Widmer, 2008).
By pooling RNA samples from different developmental stages, tissues types and microhabitats, we have now increased the publicly available molecular genetic information for mangroves severalfold. Before this, the only other mangrove for which a significant number of ESTs (20 664) was publicly available was B. gymnorrhiza. Smaller libraries have been deposited in public databases for several other species, primarily for Avicennia spp. and Sonneratia spp.
For our first goal, we selected GSFLX pyrosequencing based on cost effectiveness, rapidly improving accuracy, and technological improvement to the point that read lengths are suitable for de novo assembly of sizeable contigs even in the absence of genome-based templates (Cheung et al., 2006; Wicker et al., 2006; Novaes et al., 2008; Vera et al., 2008). The success of this approach is indicated by the fact that, with what amounted to very limited pyrosequencing, we were able to generate, de novo, two annotated mangrove transcriptomes to a combined depth of 536 550 ESTs and 17 066 gene models (Table 1). Using estimates based on model plants (Cheung et al., 2006; Weber et al., 2007), we expect that these cover 50% of the transcribed, polyadenylated portion of the genome.
Comparison of GSFLX and Sanger-type sequences in this study showed that sequence disparities between the two methods are negligible for the purposes of annotation, a conclusion supported by previous studies (Agaton et al., 2002; Gharizadeh et al., 2006). Each mangrove annotation represented, on average, 10 overlapping ESTs, which would compensate for sequencing errors if they did occur (Huse et al., 2007). In fact, the GSFLX sequencing approach may have increased coverage and accuracy over what is possible by Sanger sequencing as it has previously been reported that some genes which are recalcitrant to cloning in conventional EST sequencing posed no problem for GSFLX pyrosequencing (Weber et al., 2007).
Central to downstream uses of the assembled transcriptome is the annotation process. Our approach was designed to capture the most complete and informative annotation possible. This resulted in the three groupings based on their reconciliation with sequences deposited in public databases. Critical to our success here was the decision to occasionally reject a higher alignment score or a lower e-value, if it allowed us to replace uninformative descriptions, such as ‘hypothetical protein’, with more functional annotations. This led to 33% of the transcripts being successfully placed in class R1. Using the same approach, we have annotated the c. 24 000 Bruguiera ESTs in GenBank, successfully converting around 50% of the annotations to R1 status that had previously been identified by clone ID alone. In addition, we were able to annotate 88% of all other publicly available mangrove ESTs to class R1 or R2. Nevertheless, there remained sequences, which despite the sequential blast protocol produced no annotation (R3). Clearly, this is not a case limited to mangroves. A single FLX run in Zea mays, for example, revealed over 9000 maize-specific ‘orphan genes’ (maize R3) many of which had not previously been detected with conventional EST libraries (Emrich et al., 2007). Given the fact that mangroves display numerous structural and physiological adaptations to an extreme environment, R3 sequences of mangroves, amounting to 55% of the transcriptome, may play a key role in understanding different physiological strategies utilized by different mangrove species.
Our emphasis on capturing all possible expressed paralogs carries with it a potential error (i.e. that a single gene would be counted multiple times when it is represented as nonoverlapping contigs). While this was a more serious problem in earlier models of the 454 GS sequencers, it has been minimized in GSFLX by improving contig length, and in our study > 12 000 contigs were longer than 500 bp.
With the specific goal of finding examples of this error, we inspected 10 individual annotation groups, for the CAT, DFR, CHS, PIP, TIP, SOS, PAL, NHX, 4CL and AHA gene families. In all cases, the potential error was contraindicated (i.e. multiple contigs overlapped and presented as unique transcripts). Figure S2 in the Supplementary Information, for example, shows a region of Arabidopsis catalase 2 (AtCAT2) aligned with the six H. littoralis homologs that share identity ranging from 72 to 90%. The Arabidopsis genome has three catalase genes, including four splice variants, for a total of seven gene models, while Populus has four genes with five gene models. In the mangrove, in the absence of a fully sequenced genome, we do not have sufficient information to distinguish between splice variants. Thus, as Fig. S2 highlights, the eight unique catalase transcripts in H. littoralis represent the minimum number of actual gene models.
Is there a genetic basis for a ‘mangrove lifestyle’?
Our second goal was to mine the transcriptome database for indications of an essentially mangrove-specific transcript complement that might reflect their evolutionary convergence. Such convergence implies adaptive changes by which distantly related entities come to appear more related than they are (Doolittle, 1994). Convergent evolution has occurred at all levels of biological organization including morphological structures, proteins, gene families, organelle genomes and regulatory gene circuits (Conant & Wagner, 2003; Stiller et al., 2003). In mangroves, phenotypic convergence appears not to be easily discerned at the gene sequence level, but must be instead recognized at higher organization levels.
Parani et al. (1998) approached this question earlier using random amplified polymorphic DNA (RAPD) and restriction fragment length polymorphism (RFLP) markers to consider the evolution of mangroves from terrestrial species. They examined 11 species of true mangroves, three species classified as ‘minor mangroves’ (defined as usually only having limited representation in the community), seven mangrove associates, mangrove parasites and terrestrial salt-tolerant species, and Solanum esculentum (as an outgroup). They generated a dendrogram depicting genomic level relationships between the species that, in some cases, suggested relationships far different from those based on systematic classifications. They concluded that mangroves, in evolving multiple times from different lineages, had converged to having significant genetic homogeneity underlying physiological strategies critical to survival in the intertidal.
Our results suggest that gene expression convergence has occurred with respect to the number of transcripts in all of the major categories recognized in GO classifications (i.e. molecular functions, biological processes and cellular localization; Figs 3, S1) as well as those associated with specific metabolic pathways (KO, Fig. 4). With respect to the number of transcripts linked to each GO lineage or KO biological pathway, in all cases, the two mangroves demonstrated remarkable similarities to each other, and fundamental differences to the model species (which also differed strongly from each other).
Clearly, on first discovering patterns as apparent as these, it is essential to consider potential artifacts. First, for example, the remarkable similarity of mangrove transcriptome profiles might reflect their phylogeny. However, each of the mangroves is more closely related to one of the model species than it is to the other mangrove; Populus and R. mangle are grouped in the clade eurosids I, while Arabidopsis and H. littoralis are in the clade eurosids II (Soltis et al., 2000). Moreover, the physiological strategies employed by the mangroves to thrive in their extreme environment differ substantially.
Second, the similarities might reflect the similarity of the samplings used in constructing the cDNA libraries. However, more than 60 different tissue types and growth condition combinations, field and glasshouse, were included for R. mangle compared with eight glasshouse tissues sampled for H. littoralis. Based on that, we would have no reason to expect more similarity between the mangroves than between one of them and one of the models.
Third, the similarity might simply reflect an equal number of sequences being generated for the two mangroves. However, H. littoralis was represented by 30% more ESTs than R. mangle (Table 1) because of an additional partial GSFLX run during the optimization phase.
Fourth, the similarity of the results for the GO and KO analyses could reflect the dependence of one set of category assignments on the other. However, the functional assignments to GO lineages and KO categories were made independently, and only 18% of the transcripts were assigned both a GO and a KO annotation. Moreover, within the GO lineages, only half of the sequences had assignments in all three categories. Thus, the data sets represented in Figs 3, 4 and S1 were substantially different. Nevertheless, all three GO profiles and the KO profiles support the notion that there is a common pattern in mangroves which is not apparent in the model plants. In addition, although the mangrove transcriptomes are most likely < 50% explored and the collection includes a substantial number of R3 sequences, it is already clear that in certain KO categories (Fig. 4), their gene percentages are greater than in the model plants whose genomes have been sequenced. Considering the nature of the divergent transcripts, these are not simply scattered instances but rather they encode genes whose presence would be expected based on the specific needs of plants growing in extreme environments.
Finally, the contrasts could reflect the fact that model plant genomes were compared with mangrove transcriptomes. However, all of the GO and KO annotations were made by comparison with gene models. In the case of GO, in particular, this was unavoidable: Arabidopsis is the only plant species for which independent GO assignments have been made. In addition, the mangrove transcriptomes were sampled to represent multiple conditions as much as possible, in order to give a comparable set of gene models that imitate the gene representation of their genomes.
Therefore, we conclude that the unusual similarities observed in the two mangrove transcriptome profiles suggest a unique mangrove genomic lifestyle overarching the effects of transcriptome size, habitat, tissue type, developmental stage, and the biogeographic and phylogenetic differences that exist between the two species. This strongly favors convergent evolution playing a role at the transcriptome level in two diverse species that evolved separately to fit a common habitat.
The authors are indebted to the Smithsonian Caribbean Coral Reef Ecosystem project, and especially to Klaus Ruetzler, Candy Feller, Mike Carpenter and all the Carrie Bow Cay station managers for continued support for the field work and access to Twin Cays. The authors also thank the Vice Chancellor for Research at the University of Illinois at Urbana-Champaign for funding the sequencing itself, Indu Rupassara for H. littoralis seed collection, Bette Chapman for invaluable assistance in field sampling, Michelle Harris and Kaley Major for assistance in extracting more than 500 RNA samples, Suniti Karunatillake for programming assistance, and Shahjahan Ali, Jyothi Thimmapuram and Deepika Vulaganthi in the Keck Center for Comparative and Functional Genomics at UIUC for sequencing and assembly. This is contribution number 860 of the Caribbean Coral Reef Ecosystems Program (CCRE), Smithsonian Institution, supported in part by the Hunterdon Oceanographic Research Endowment.