Novel protein domains and motifs in the marine planctomycete Rhodopirellula baltica


*Corresponding author. Tel.: +44-1223-834244; fax: +44-1223-494919, E-mail address:


The planctomycetes are a phylum of bacteria that have a unique cell compartmentalisation and yeast-like budding cell division and peptidoglycan-less proteinaceous cell walls. We wished to further our understanding of these unique organisms at the molecular level by searching for conserved amino acid sequence motifs and domains in the proteins encoded by Rhodopirellula baltica. Using BLAST and single-linkage clustering, we have discovered several new protein domains and sequence motifs in this planctomycete. R. baltica has multiple members of the newly discovered GEFGR protein family and the ASPIC C-terminal domain family, whilst most other organisms for which whole genome sequence is available have no more than one. Many of the domains and motifs appear to be restricted to the planctomycetes. It is possible that these protein domains and motifs may have been lost or replaced in other phyla, or they may have undergone multiple duplication events in the planctomycete lineage. One of the novel motifs probably represents a novel N-terminal export signal peptide. With their unique cell biology, it may be that the planctomycete cell compartmentalisation plan in particular needs special membrane transport mechanisms. The discovery of these new domains and motifs, many of which are associated with secretion and cell-surface functions, will help to stimulate experimental work and thus enhance further understanding of this fascinating group of organisms.


The planctomycetes are a phylum of bacteria that have a unique cell compartmentalization, yeast-like budding cell division and peptidoglycan-less proteinaceous cell walls [1]. They share a cell organisation involving division of the cell cytoplasm into at least two major compartments separated by an intracytoplasmic membrane, with one compartment completely enclosing the nucleoid DNA. In species related to Pirellula, the major nucleoid-containing compartment is called the pirellulosome, and is associated with a polar asymmetry in the cell plan [2]. Rhodopirellula baltica (formerly known as Pirellula species strain 1 but now reclassified in a new genus) also possesses a pirellulosome compartment [3]. In one planctomycete genus, Gemmata, the nucleoid is enveloped via a double membrane, forming a nuclear body analogous to the eukaryotic nucleus [4,5]. On the basis of rRNA molecular phylogenetics, the planctomycetes are quite distinct, and have been proposed to be one of the deepest branching bacterial phyla, diverging perhaps even earlier than the hyperthermophilic Bacteria [6]. They are thus significant for the study of bacterial evolution both for their unique eukaryote-like cell structure and for their phylogenetic position as a potentially ancient lineage.

Recently the first planctomycete complete genome sequence became available for R. baltica[7]. An unusually high proportion of the genome encoded proteins for which no function could be reliably predicted. This is reflected in the fact that only 37% of R. baltica proteins scored a significant match to any entry in release 12.0 (January 2004) of the Pfam database of protein domain families [8] whilst for most bacterial proteomes there is between 55% and 95% coverage. We wished to further our understanding of these unique organisms at the molecular level by searching for conserved amino acid sequence motifs and domains in the proteins encoded by R. baltica.

2Materials and methods

For each of the 4603 R. baltica proteins with no match to any Pfam entry, we performed a BLAST search [9] against the database of these 4603 proteins using an E-value threshold of 0.01, and on this basis assigned each protein sequence into a cluster. Each cluster of sequences containing more than 5 sequences was aligned using MAFFT [10]. These automatically generated alignments were then manually trimmed and edited. From the edited ‘seed’ alignments, hidden Markov models (HMMs) [11] were generated using HMMER []. These HMMs were then used to search the Uniprot SwissProt and trEMBL protein sequence databases [12,] and to generate ‘full’ alignments of the candidate protein domain families. These alignments were again manually edited and used to generate HMMs for HMMER searches. These steps were reiterated until the family converged on a stable set of members. Member sequences were also investigated using PROSITE [13,] to identify matches to previously described sequence features. This approach led to the identification of several novel protein domain families and motifs, which have been entered into the Pfam database [8,] under the Accession Numbers listed in Table 1.

Table 1.  Novel protein families, domains, and motifs discovered in this study
Family/domain/motifPfam Accession No.Species distribution (number of proteins)
Solute binding protein familyPF07596Rhodopirellula baltica (74)
Planctomycete-specific signal peptide motifPF07595Rhodopirellula baltica (29)
GEFGR protein familyPF07394Rhodopirellula baltica (41), Gloeobacter violaceous (1), Agrobacterium tumefaciens (1), Mesorhizobium loti (1), Bradyrhizobium japonicum (1), Caulobacter crescentus (2), Vibrio vulnificus (3), Vibrio haemolyticus (1), Bordetella bronchiseptica (1), Bordetella parapertussis (1), Bordetella pertussis (1), Ralstonia solanacearum (1), Chromobacterium violaceum (1), Xanthomonas axonopodis (1), Xanthomonas campestris (1)
HXXSHH protein familyPF07586Rhodopirellula baltica (17)
Planctomycete-specific cytochrome C domain 1PF07635Rhodopirellula baltica (51)
Planctomycete-specific cytochrome C domain 2PF07583Rhodopirellula baltica (42)
Planctomycete-specific cytochrome C domain 3PF07627Rhodopirellula baltica (17)
Planctomycete-specific domain 1PF07587Rhodopirellula baltica (41)
Planctomycete-specific domain 2PF07624Rhodopirellula baltica (15)
Planctomycete-specific domain 3PF07626Rhodopirellula baltica (14)
Planctomycete-specific domain 4PF07631Rhodopirellula baltica (16)
Planctomycete-specific domain 5PF07637Rhodopirellula baltica (15)
ASPIC C-terminal domainPF07593Rhodopirellula baltica (13), Homo sapiens (6), Rattus norvegicus (1), Mus musculus (3), Gloeobacter violaceus (1), Streptomyces macromomyceticus (1), Streptomyces kaniharaensis (1), Kitasatospora species (1), Streptomyces ghanaensis (1), Streptomyces citricolor (1), Streptomyces carzinostaticus (1), Streptomyces cavourensis (1), Streptomyces globisporus (1), Streptomyces species (2), Lechevalieria aerocolonigenes (1), Amycolatopsis orientalis (1), Micromonospora megalomicea (1), Micromonospora species (1), Micromonospora echinospora (1), Micromonospora chersina (1)
PEGSRP motifPF07623Rhodopirellula baltica (14)
VPEP motifPF07589Rhodopirellula baltica (12), Nostoc species (1), Nitrosomonas europaea (4), Chlorobium tepidum (2)

3Results and discussion

As a result of single-linkage clustering of R. baltica proteins on the basis of BLAST matches, we identified 15 novel motifs, domains, and protein families with 10 or more members in R. baltica. These are summarised in Table 1 and discussed in the following sections. Multiple sequence alignments for each of these families are accessible from the Pfam database [8,].

3.1A novel solute-binding protein family

We identified a family of 74 R. baltica proteins that did not contain significant matches to any known domains. Most of these 74 R. baltica proteins are about 300–400 amino acids long, though some have C- or N-terminal extensions that do not match any known domain families (see Fig. 1). One notable exception is RB9435 (Uniprot:Q7ULK7), a 1030 amino acid protein that has a serine/threonine-protein kinase domain (Pfam:PF00069) at the N terminus and a predicted transmembrane region near the centre of the sequence. The majority of the 74 proteins have a predicted signal peptide and/or transmembrane region at the N terminus, suggesting an extracellularlocation. There is about 13% sequence identity between these R. baltica proteins and the Chlamydophila caviae solute binding protein CCA00262 (Uniprot:Q823Z1). Solute-binding proteins are high-affinity active transport systems, some of which also function in initiation of sensory transduction systems [14]. Based on the average size of the domain and sequence similarity (albeit low), we propose that these comprise a novel family of solute binding proteins, functionally and perhaps structurally similar to those of Solute Binding Protein family 3 (Pfam:PF00497).

Figure 1.

Domain architectures of novel Rhodopirellula baltica putative solute-binding proteins. The predicted signal peptide, predicted by SignalP [31], is indicated by a circle marked ‘S’. Predicted trans-membrane regions are indicated by rectangles labelled ‘TM’. Predicted coiled coil is indicated by the rectangle labelled ‘coil’. Protein domains are indicated by shaded ovals. SB, putative planctomycete-specific solute binding protein domain (PF07596); pkinase, protein kinase domain (PF00069); SB_3, bacterial extracellular solute-binding proteins domain family 3 (PF00497).

The membrane-located RB9435 protein probably transmits a signal into the cytoplasm via the protein kinase domain on binding some unknown substrate. A similar domain architecture to that of RB9435 is also seen in the Streptomyces coelicolor hypothetical protein SCO4911 (Uniprot:Q9EWU9, see Fig. 1).

3.2A novel signal peptide motif

Perhaps the most interesting sequence feature to come out of this study is a motif found at the N terminus of 34 diverse R. baltica proteins. The motif, which can be represented as RRLxxExLExRxLLA, is preceded by a less conserved ‘leader’ sequence that is very rich in positively charged residues (Fig. 2). Superficially, this motif is reminiscent of the twin arginine transporter (TAT) motif [15] in that it contains two adjacent arginine residues, but there is little similarity beyond this.

Figure 2.

Alignment of the N-terminal regions of the Rhodopirellula baltica proteins containing the novel signal peptide motif. For each sequence, the ORF code (RB number), Uniprot Accession Number and sequence coordinates are given in the left-hand column. The alignment was prepared using Jalview [32].

Proteins containing the N-terminal motif range in size from 148 amino acid residues (RB7063) to 8173 residues (RB11769), most being more than 1000 residues long (see Table 2). Although none of these proteins have been studied in the laboratory, we can infer some clues about their structure, function and localisation from their domain content (see Table 2). The overall theme of the domains found in these proteins is that they are largely associated with extracellular and cell-surface functions. For example, eleven of these 34 proteins contain dockerin repeats (e.g., RB6459, RB7321, RB12697) and two contain cohesin domains (RB893 and RB886). The dockerin repeat is the binding partner of the cohesin domain. The cohesin–dockerin interaction is the crucial interaction for complex formation in the cellulosome [16]. Other domains found in these proteins that are expected to be extracellular include an immunoglobin-like domain (PF05345) in RB6278, hemolysin-type calcium-binding repeat (PF00353) in RB10149, bacterial neuraminidase repeat (PF02012) in RB886 and the EF-hand (e.g., RB11133). However, there are no matches to known classes of signal peptide sequences in any of these proteins. Therefore, we propose that this N-terminal motif represents a novel class of signal peptide that marks the protein for export to the cell surface. This motif may be specific to Rhodopirellula, as BLAST searches at The Institute for Genomic Research website [] failed to demonstrate conservation in the unfinished genome sequence of the planctomycete Gemmata obscuriglobus (data not shown). However, this issue will not be fully resolved until the complete genomes of G. obscuriglobus and other planctomycetes become available.

Table 2. R. baltica proteins containing the novel signal peptide motif
  1. Uniprot Accession Numbers of each protein are given in parentheses. Proteins domains recognised by the Pfam and SMART databases [8,30] are listed with their Accession Numbers in parentheses.

RB11769 (Q7UDU8)8173calx-β (PF03160); Cadherin (PF00028); CA (SM00112); FN3 (SM00060);
RB7341 (Q7UNV3)7538PbH1 (SM00710);
RB5524 (Q7URP6)7223PbH1 (SM00710); ZnMc (SM00235);
RB7321 (Q7UNV4)6157Dockerin (PF00404); PbH1 (SM00710);
RB3077 (Q7UUS9)6007PbH1 (SM00710);
RB1924 (Q7UWM5)4630calx-β (PF03160); Cna_B (PF05738); TSP3 (PF02412); PbH1 (SM00710);
RB10149 (Q7UFE9)3178Hemolysin Ca binding (PF00353); Cadherin (PF00028); CA (SM00112);
RB4375 (Q7USQ0)3056Polymorphic membrane protein (PF02415); calx-β (PF03160); Dockerin (PF00404); PbH1 (SM00710);
RB10423 (Q7UF08)2079Dockerin (PF00404); PKD (SM00089);
RB6459 (Q7UQ89)2028Dockerin (PF00404);
RB886 (Q7UY44)2009BNR (PF02012); Cadherin (PF00028); CA (SM00112); CADG (SM00736); Cohesin (PF00963);
RB10666 (Q7UKF9)1969Domain of unknown function, DUF11 (PF01345);
RB12720 (Q7UI70)1901Lectin C (PF00059); CLECT (SM00034);
RB9376 (Q7ULP1)1827Cadherin (PF00028); immunoglobulin-like (PF05345); CA (SM00112);
RB1934 (Q7UWM1)1826Calcineurin-like phosphoesterase (PF00149); 5′-nucleotidase, C-terminal domain (PF02872); Dockerin (PF00404);
RB9330 (Q7ULR8)1756Cna protein B (PF05738);
RB12697 (Q7UI80)1703Dockerin (PF00404);
RB844 (Q7UY66)1543None
RB6278 (Q7UQJ9)1541Cyclophilin type peptidyl-prolyl cistrans isomerase (PF00160); immunoglobulin-like (PF05345);
RB9053 (Q7UM58)1521Domain of unknown function, DUF11 (PF01345); Cna protein B (PF05738);
RB10413 (Q7UF12)1352Peptidase S8 (PF00082);
RB6025 (Q7UQX3)1094Dockerin (PF00404);
RB2442 (Q7UVU1)1050Dockerin (PF00404);
RB3075 (Q7UUT0)888Dockerin (PF00404);
RB633 (Q7UYG2)831Dockerin (PF00404); Animal peroxidase (PF03098);
RB11131 (Q7UJQ5)779Dockerin (PF00404); Animal peroxidase (PF03098);
RB10159 (Q7UFE7)709None
RB2401 (Q7UVW2)706Dockerin (PF00404); vanadium chloroperoxidase (PF02328);
RB577 (Q7UYI1)703PbH1 (SM00710);
RB11133 (Q7UJQ3)635Dockerin (PF00404); Peptidase M10 (PF00413); EF hand (PF00036); ZnMc (SM00235);
RB893 (Q7UY43)419Cna protein B (PF05738); Cohesin (PF00963);
RB7063 (Q7UPA7)148None

Several of these proteins contain calx-β domains [17] (e.g., RB1924), cadherin domains [18] (e.g., RB10149) and thrombospondin type 3 domains [19] (e.g., RB1924). These domains are widespread in eukaryotes, where they are associated with mediating cell-cell interactions. They are also present in some prokaryotic hypothetical protein sequences; for example, according to Pfam release 12.0, there are 630 eukaryotic proteins and 10 bacterial proteins containing calx-β domains, of which six are from R. baltica.

In broth media, planctomycetes related to Pirellula form cell aggregates called rosettes, which may be related to their natural growth habit in aquatic habitats. Also, Pirellula staleyi attaches firmly to glass surfaces [20–22]. Pirellula species form ‘holdfasts’ and ‘multifibrillar fascicles’, which may correlate with such rosettes and with attachment to natural habitat and glass surfaces. Members of the genus Planctomyces produce multifibrillar stalks (Fuerst, 1995) whilst Blastopirellula marina and strain ATCC35122, which is closely related to Pirellula staleyi[3], have been shown to produce pili and flagella [20,23,24]. These structures probably require major extracellular protein export [25]. Although there is no evidence that this novel N-terminal motif is involved in secretion of flagella and pili, proteins such as RB10149, RB11769 and RB1924, containing the motif as well as calx-β, cadherin and thrombospondin type 3 domains, are good candidates for involvement in these cell behaviours that are characteristic of planctomycetes.

3.3Other novel families of secreted proteins

We identified two further novel families of R. baltica proteins that are probably secreted. One of these families (Pfam:PF07394) includes 41 R. baltica proteins and several proteins from various members of the Proteobacteria and the cyanobacterium Gloeobacter violaceous. The second family (Pfam:PF07586) exclusively comprised 17 R. baltica proteins. The functions of these proteins are not known; however, many of the PF07394 proteins (e.g., RB7228, RB11746 and RB6101) and at least three of the PF07586 proteins (RB6572, RB3036 and RB11884) have matches to the TAT motif, suggesting that they are exported across the cellular membrane in a prefolded state [15].

3.4Novel heme-binding and associated domains

In R. baltica, 51 proteins share a novel N-terminal domain that contains the motif characteristic of the heme-binding site of cytochrome C (PROSITE:PS00190) suggesting a function in redox reactions. We refer to this domain as the planctomycete-specific cytochrome 1 (PSC1) domain. Two further probable cytochrome-related domains, PSC2 and PSC3, were found in 42 and 17 R. baltica proteins, respectively. Five further new domains were also found in various combinations in the PSC-containing proteins. We named the novel domains planctomycete-specific domain 1 (PSD1) through to planctomycete-specific domain 5 (PSD5) and deposited them in the Pfam database with Accession Numbers listed in Table 1.

The PSC2 and PSD1 domains nearly always occur together (but are separated by a variable inter-domain region of between 35 and 500 amino acids), and can thus be expected to form a functional module. In RB6511, a discoidin domain (Pfam:PF00754) is inserted between the PSC2 and PSD1 domains (Fig. 3). Eukaryotic proteins containing the discoidin domain have previously been implicated in cell adhesion or development [26]. Interestingly, the genes encoding several PSC2/PSD1-containing proteins (RB488, RB2845, RB4233, and RB10051) each appear to fall within operons that also encode ECF sigma factors (RB486, RB2840, RB4241, and RB10049) suggesting a mechanism for their regulation [27].

Figure 3.

Example domain architectures of Rhodopirellula baltica proteins containing novel domain families. Protein sequences are identified by genome ORF codes and by Uniprot Accession Numbers. Pfam domains are indicated by coloured ovals. C1, planctomycete specific cytochrome domain 1 (PSC1) (PF07635); C2, PSC2 (PF07583); C3, PSC3 (PF07627); PSD1, planctomycete specific domain 1 (PSD1) (PF07587); PSD2, PSD2 (PF07624); PSD3, PSD3 (PF07626); PSD4, PSD4 (PF07631); PSD5, PSD5 (PF07637); Disc, discoidin domain (PF00754); Ig, immunoglobulin-like domain (PF02368); WD40, WD40 repeat (PF00400). Predicted signal peptides, predicted by SignalP [31], are indicated by yellow circles marked ‘S’. Predicted trans-membrane regions are indicated by rectangles labelled ‘TM’.

3.5Relationship with chlamydias

It has been suggested that the planctomycetes have a close evolutionary relationship with the chlamydias, e.g., [28], although this is controversial, e.g., [29]. None of the novel planctomycete protein families described here are conserved in the chlamydias, possibly due to the reductive evolution that has led to the small streamlined genomes of these intracellular parasites. However, we have observed one molecular feature shared by both R. baltica and Chlamydia species: the R. baltica proteins RB12436 and RB12437 each consist of an ABC transporter family 3 domain (Pfam:PF00950) fused to a diphtheria toxin repressor protein-like C-terminal DNA-binding domain (Pfam:PF02742). Homologues with the same domain architecture (Fig. 4) are also found in C. pneumoniae (CPN0347), C. muridarum (TC0341) and C. trachomatis (CT069), but are not known in any other organism. This predicted metal-responsive transporter/regulator protein might represent a ‘molecular fossil’ dating back to a common origin of the planctomycetes and chlamydias.

Figure 4.

A protein domain architecture exclusive to Rhodopirellula baltica and Chlamydia species. The predicted signal peptide, predicted by SignalP [31], is indicated by circles marked ‘S’. Protein domains are indicated by shaded ovals. ABC3, ABC 3 transport family (PF00950); B and D, diptheria toxin repressor-like metal binding and dimerisation domain (PF02742).


In conclusion, we have discovered several new protein domains and sequence motifs in the planctomycete R. baltica. Many of the proteins involved are predicted to be extracellular. With their unique cell biology, it may be that the planctomycete cell compartmentalisation plan in particular needs special membrane transport mechanisms. Relatives of Pirellula possess a single intracytoplasmic membrane dividing the cell into the major nucleoid-containing compartment (pirellulosome) and an outer ‘paryphoplasm’ compartment; therefore, such export might occur across either the intracytoplasmic membrane or the cytoplasmic membrane or both. Since all the ribosomes in these organisms seem to be concentrated in the pirellulosome it may be assumed that any protein export must occur across at least the intracytoplasmic membrane and might conceivably in some cases be applied to proteins destined to remain within the paryphoplasm rather than crossing the true cytoplasmic membrane closely apposed to the cell wall.

Given the supposed deep branching position of the planctomycetes, it is perhaps not surprising that they should be such a source of molecular diversity. If the deep branching proposition is indeed correct, then it is possible that these protein domains may have been lost or replaced in other phyla, while they may have undergone multiple duplication events in the planctomycete lineage. It is also possible that the planctomycete lineage diverged more recently and has undergone very rapid evolution. We hope that the discovery of these new domains and motifs, associated with secretion and cell-surface functions, will help to stimulate research and enhance further understanding of this fascinating group of organisms.


Research on planctomycetes in JAF's laboratory is supported by the Australian Research Council. DJS and AB are supported by the Medical Research Council (UK) and the Wellcome Trust. We thank Ben Vella Briffa for his contribution to building Pfam family PF07394.

Preliminary sequence data for Gemmata obscuriglobus was obtained from The Institute for Genomic Research website at Sequencing of Gemmata obscuriglobus was accomplished with support from the DOE, and the project is led by Dr. Naomi Ward.