To predict protein families forming supramolecular assemblies potentially analogous to the BMC shell, a search was conducted for families with a similar genomic signature to microcompartment shell proteins. Families were defined using the InterPro domain database32 and filtered in a multistep procedure (Fig. 2). Operons were predicted across all archaeal and bacterial sequences from 593 prokaryotic organisms and searched for the desired genomic arrangement of multiple paralogous proteins occurring together [Fig. 1(C)]. Additionally, protein families were filtered for an average length of less than 200 amino acids to obtain proteins that might constitute relatively compact building blocks in larger structures; most microcompartment shell proteins are ∼100 amino acids in length. Families were subsequently filtered to select those with tandem occurrences within an operon in three or more organisms, and with at least three copies in at least one of those organisms, to arrive at a set of families whose analogy to BMC proteins was significant over multiple organisms. Finally, families were ranked according to the average number of paralogous co-occurrences seen in each operon (Supporting Information), and the top 35 were arbitrarily selected for further analysis (Table I).
Of these 35 families, 16 have some functional annotation. Of these, only six (IPR008846, IPR000290, IPR003369, IPR006817, IPR003731, and IPR008894) have no evidence in the available literature for supramolecular assembly formation. We cannot rule out the possibility that some of these might actually be involved in forming large assemblies that have not yet been discovered, but at present they constitute false positives in the analysis. As our search criteria were aimed at sensitivity above specificity, it is not unexpected that a number of protein families not involved in supramolecular assemblies would be obtained. The false negatives are distributed toward the lower end of the spectrum in terms of average numbers of paralogs per operon: 1.89, 1.32, 1.30, 1.20, 1.19, and 1.17 versus a mean of 1.78 for the 35 families identified by the search criteria. The other 10 of the 16 characterized families are indeed involved in supramolecular assemblies. This appears to represent a strong enrichment for structural proteins over what would be expected at random, although no attempt was made here to establish how low a number would be expected for a randomly selected set of protein families.
Analysis of computationally identified protein families
Consistent with the design of this study, the annotated family with the highest score (an average of 2.56 paralogs per operon) was the BMC protein family discussed earlier (IPR000249). Of the 538 members of this family identified (see Supporting Information), only 13% (70 proteins) were the sole paralogs in their operon, of which only two cases were the only copies in their organism. Almost two thirds (65%; 348 proteins) occured in operons with at least two other BMC proteins. There are multiple cases with many BMC paralogs per operon: 13 operons have six paralogs, two operons have seven paralogs, and one operon has nine BMC paralogs. It addition, it was surprising to find that the search identified another protein family involved in BMCs. This protein family (IPR004992), which includes proteins known as CcmL, EutN, and CsoS4 in different bacteria, has been suggested to contribute to the vertices of the icosahedral or near-icosahedral microcompartment shell.4 The tendency of this shell protein to appear as multiple paralogs in an operon has hitherto been underappreciated.
The major structural protein from gas vesicles, which are large proteinaceous shells used for buoyancy in prokaryotes, was also identified in our study. The GvpA family (IPR000638) frequently occurs with multiple copies per operon, with 57 of the 96 proteins identified co-occurring together in operons. Of the remaining 39 single occurrences, only 19 are the sole copy in their genome. A small subset of organisms encoding gas vesicles have larger, more complex operon structures. For example, there are a total of 11 GvpA domain proteins in Rhodococcus sp. RHA1 split over seven operons, while Streptomyces avermitilis MA-4680 has three operons encoding three GvpA homologs each. Closer inspection reveals that the majority of cases are split over two proximal operons transcribed in opposite directions, increasing the number of proximal paralogs [Fig. 3(A)]. The structural basis for assembly of GvpA proteins into gas vesicles is not yet understood in detail.
The small heat-shock protein (Hsp) family identified, which includes the alpha-crystallins (IPR002068), is understood to form large assemblies of multiple, different paralogs.40 This family is more widespread (found in 434 organisms) than any others listed in Table I. Although ∼75% (568 of 742) of the members of this family do not co-occur with other paralogous copies in an operon, this nevertheless leaves 150 cases of co-occurrence with another paralog and 24 proteins in operons containing three paralogs.
The phycobiliproteins, represented by the IPR012128 domain, are components of phycobilisomes, which are light-harvesting supramolecular assemblies in cyanobacteria. Biliproteins form dimers of alpha and beta subunits, which assemble further to form large rod-like components of phycobilisomes and subcellar structures that funnel light energy via bound chromophores to a central core.36, 42, 43 Diverse biliproteins appear to be evolutionarily related; allophycocyanin and phycocyanin were extracted by our study. A total of 28 organisms, all cyanobacteria, contain this protein family, the majority of which (122 proteins of 171) occur together with another paralog.
Finally, our study identifies five domains that are portions of type IV pilins and related proteins: IPR012902, IPR001120, IPR012495, IPR002416, and IPR007047. Type IV pili are long rod-like or filamentous assemblies.44 They are widespread among prokaryotes, where they serve varied roles, especially related to host cell interactions.45 The N-terminal segments of pilin proteins are characterized by two somewhat degenerate methylation consensus sequences, the N-terminal prepilin-type cleavage/methylation motif (IPR012902), and the prokaryotic N-terminal methylation motif (IPR001120). The N-terminal prepilin-type cleavage/methylation site occurs in 2644 proteins split over 380 organisms. Some 205 cases occur with four or more other copies in the same operon. In addition to these widespread sequence motifs, specific pilin-like proteins are also detected here. The IPR002416 domain (which includes the IPR001120 motif noted above as a subdomain) describes the GspH pseudopilin family. Finally, the IPR007047 and IPR012495 domains represent two pilin-related subunits from the tad locus involved in tight, nonspecific adhesion of pathogenic bacteria to surfaces. In both cases, while the proteins occur as single copies the majority of the time, there are nevertheless numerous instances where two or three paralogs occur together.
Of the 35 protein families identified, 19 are completely uncharacterized. These families (IPR006728, IPR007966, IPR014994, IPR009482, IPR012655, IPR012452, IPR010738, IPR010665, IPR009881, IPR007670, IPR012661, IPR011747, IPR010351, IPR009333, IPR008316, IPR007166, IPR010310, IPR012903, and IPR010385) are distributed over a variety of different organisms. Based on an extrapolation from the characterized proteins identified, this set of uncharacterized protein families is likely to be highly enriched in proteins involved in forming novel supramolecular assemblies. We sought to investigate the capacity for self-assembly by one of these uncharacterized families. The family described by the IPR009482 domain was arbitrarily chosen for closer scrutiny. Proteins belonging to this family are found in five hyperthermophilic archaea from our search set, Archaeoglobus fulgidus DSM 4304, Pyrococcus abyssi GE5, Pyrococcus furiosus DSM 3638, Pyrococcus horikoshii OT3, and Pyrococcus kodakarensis KOD1. Figure 3(B) illustrates the arrangement of paralogs from this family in operons. No clear inferences about function could be derived from the other genes encoded in the operons along with these putative structural proteins.
Assembly properties of a previously uncharacterized protein family
To test for the ability of the selected protein family to self-assemble, the four paralogs of the IPR009482 family from Archaeoglobus fulgidus were selected for biophysical characterization and determination of oligomeric state. The genes encoding AF2077, AF2079, AF2080, and AF2081 were cloned, overexpressed in E. coli, and the protein products purified. The four proteins were initially expressed in insoluble form, but after unfolding and refolding from inclusion bodies the target proteins were soluble to varying degrees. AF2079, AF2080, and AF2081 were soluble up to a concentration of 10 mg/mL, while AF2077 remained only marginally soluble. Circular dichroism studies confirmed that all three soluble proteins maintained a similar secondary structure composition, primarily beta sheet [Fig. 4(A)].
Native PAGE analysis revealed that each protein forms multiple distinct higher-order oligomeric states [Fig. 4(B)]. This behavior was reminiscent of some of the BMC proteins from the carboxysome shell studied earlier [Fig. 4(C)3]. Determining the stoichiometry of assembly was complicated by the ladder of oligomeric states exhibited by each protein in native gels (Fig. 4). Size exclusion chromatography failed to separate individual species fully, but resultant fractions did show enrichment for different oligomeric states (data not shown). Size-exclusion results indicated that the ladder of oligomeric states ranged from relatively small oligomers to large (>500 kDa) assemblies. Dynamic and static light scattering experiments were consistent with this size range but could not resolve individual species (data not shown). Native PAGE of the size-exclusion fractions showed that they maintained their respective compositions of specific oligomers over a period of at least 6 days, indicating the formation of stable oligomers. In order to estimate the oligomeric state of individual protein species, one protein, AF2081, was analyzed using a Ferguson plot.46, 47 This allows an extrapolation of native molecular weight based on the change in mobility versus gel concentration (Fig. 5), without the need to purify separate oligomeric species. The band corresponding to one of the dominant AF2081 oligomeric states was calculated as ∼149 kDa, while the faint, fastest running band was calculated to be 29 kDa. These correspond to hexamer (theoretical mass 147.05 kDa) and monomer (theoretical mass 24.51 kDa), respectively. A pentamer or heptamer could also be within the margin of error, but a hexamer gives an excellent fit. Single native gels of AF2079 and AF2080 and the marginally stable AF2077 revealed a similar pattern of varied oligomeric states [Fig. 4(B)]. The behavior of the proteins from the IPR009482 family is therefore highly suggestive of assembly into high-order structures.
Further examples
We also searched the literature for other proteins involved in large structures that might have evaded our computational analysis. Sulfur globules provide one case. In sulfur-oxidizing organisms, sulfur is stored in the periplasm for later oxidation in protein-coated sulfur globules.48 In Allochromatium vinosum, the protein coat includes two paralogs, SgpA and SgpB. Although SgpA and SgpB are paralogous, they do not co-occur together in the same operon, thus evading detection by the criteria employed. Intriguingly, these short proteins (∼100 amino acids long) are reported to show some similarity to structural proteins such as keratin, silk fibroin, and plant cell wall proteins.49 Polyhydroxybutyrate (PHB) granules provide another case of proteinaceous encapsulation, in this case for energy storage. These granules serve as storage sites for PHB polymers, which are surrounded by an amphiphilic layer of structural proteins.50 The structural proteins include four paralogs of a protein family referred to as phasins, encoded on separate operons.51 The function of phasins is to control the structure of the PHB granules, but the need for multiple paralogs is unclear.