A nonredundant set of 9081 protein crystal structures in the Protein Data Bank was used to examine the solvent content, the number of polypeptide chains, and the oligomeric states of proteins in crystals as a function of crystal symmetry (as classified by crystal systems and space groups). It was found that there is a correlation between solvent content and crystal symmetry. Surprisingly, proteins crystallizing in lower symmetry systems have lower solvent content compared to those crystallizing in higher symmetry systems. Nevertheless, there is no universal correlation between solvent content and preferences of macromolecules to crystallize in certain space groups. Crystal symmetry as a function of oligomeric state was examined, where trimers, tetramers, and hexamers were found to prefer to crystallize in systems where the oligomer symmetry could be incorporated in the crystal symmetry. Our analysis also shows that the frequency distribution within the enantiomorphous pairs of space groups does not differ significantly, in contrast to previous reports.
Water plays an important role in the structure of biomolecules and often influences protein function. Water molecules not only affect protein folding, but also mediate biological processes such as enzymatic reactions and molecular recognition. Information about the fraction of water (solvent) plays a significant role in the X-ray structure determination process. First, knowledge of the solvent content helps to determine the number of molecules in the asymmetric unit (Matthews 1968), which is crucial in early stages of crystal structure determination. Second, an approximate value of solvent content is needed for significant phase improvement by solvent flattening methods (Wang 1985; Leslie 1987; Abrahams and Leslie 1996), which is necessary to resolve the inherent phase ambiguity in single anomalous diffraction (SAD) experiments. For both SAD and MAD (multi-wavelength anomalous diffraction) (Hendrickson 1991; Hendrickson et al. 1990), phase improvement by solvent flattening is critical for low resolution data (Kirillova et al. 2007), especially when non-crystallographic symmetry cannot be applied.
Matthews (1968) observed that the solvent content in protein crystals ranged from 27% to 65%, with an average of 43%. He also showed that the quantity VM (the Matthews coefficient, defined as the ratio of the volume of the asymmetric unit to the molecular weight of all protein molecules in the asymmetric unit) is inversely proportional to solvent content. Frequency distributions of VM for protein crystals in the Protein Data Bank (PDB) (Berman et al. 2002) have been calculated by various studies (Matthews 1968; Andersson and Hovmöller 2000; Kantardjieff and Rupp 2003). While no correlation has been observed between VM and molecular weight of the protein, a correlation between VM and resolution of the diffraction data was reported, indicating that protein crystals with less solvent tend to diffract better (Kantardjieff and Rupp 2003).
Andersson and Hovmöller (2000) analyzed solvent content in the 12 most frequent space groups in the PDB and observed that the more frequent the space group, the smaller the average solvent content. Therefore it appears that solvent content variation might help answer the long-standing question of why proteins and other chemical compounds prefer to crystallize in a select few space groups.
Kitajgorodskij (1961), who analyzed crystals of organic compounds, pointed out that some space groups are preferred because they allow a greater likelihood of forming closely packed structures. In most cases, the categorization proposed by Kitajgorodskij (1961) successfully explains the relative frequencies of space groups of organic crystals. Some anomalies in this theory of close packing were discussed by Wilson (1993). The distribution of space group frequencies for crystals of organic molecules was also analyzed in detail by Filippini and Gavezzotti (1992), who formulated a theory of “good” and “bad” symmetry operators. Their method estimates the relative importance of symmetry operators by measuring the contribution of each operator to the total packing energy.
To explain the space group distribution of monomeric protein crystals, an entropic model was presented by Wukovitz and Yeates (1995). This model was used for simulation of protein crystal nucleation, and the distribution of simulated nuclei was nearly the same as the distribution of macroscopic crystals (Pellegrini et al. 1997) observed in monomeric proteins. Currently the entropic model and mathematic dimensionality introduced by Wukovitz and Yeates (1995) provide the best explanation for nonuniformity of space group symmetries in protein crystallography, despite the fact that it does not assume anything about the properties of the molecules. The analysis is more difficult when oligomeric structures are included. In many cases internal symmetry of an oligomer is utilized as part of the crystal symmetry, and as reported recently (Banatao et al. 2006), inducing oligomerization can facilitate crystallization. Regrettably, none of the existing theories attempting to explain the frequency distribution of space groups can describe all of the data. In particular, there is no common theory equally applicable to inorganic, organic, and macromolecular crystals.
Some of these theories have been evaluated by statistical analysis of structural databases. Padmaja et al. (1990) calculated the distribution of the compounds listed in the Cambridge Structural Database (Allen 2002) that crystallized in the 65 chiral space groups (space groups without operators of the second kind). Calculations narrowed to nucleosides, nucleotides, amino acids, and peptides showed that almost 80% of these compounds crystallized and determined in P212121 (50.3%) and P21 (29.0%) space groups. The next most popular space groups were P1, C2, P21212, P41212, C2221, P41, R3, P43212, P3121, P31, P61, P63, P213, P65, and P43 (in order of decreasing frequency). Padmaja et al. (1990) also analyzed deposits in the PDB that contained at that time 208 nonredundant protein structures (synthetic oligopeptides and viral structures were excluded; only one entry from sets of isomorphous structures was considered) and showed that about half of the deposits belonged to three space groups: P212121 (26.9%), C2 (12.9%), and P21 (10.6%), followed by (in order of decreasing frequency) P3121, C2221, P3221, P41212, P43212, P21212, and P1 (Padmaja et al. 1990). Remarkably, a decade and 7000 structures later, the list of most frequent space groups in the PDB had not changed significantly. Andersson and Hovmöller (2000) examined 7384 crystal protein entries in the PDB as of March 1999 (no provisions were made for dealing with redundant structures) and found the three most frequent groups were still P212121 (22.9%), P21 (13.6%), and C2 (8.9%). This study also reported that the frequency distribution of enantiomorphous pairs of space groups differed significantly.
In this work, we reexamine the solvent content distribution of protein crystals, based on a nonredundant subset of the PDB as of February 2007. In particular, we analyze correlations between solvent content and crystal symmetry (taking into account crystal system, cell type, and space group). We also examine whether the solvent content of a protein crystal depends on its oligomeric state or on the number of chains in the asymmetric unit. In addition, we investigate whether the solvent content is correlated with the frequency distribution of specific space groups and Bravais lattices among protein crystal structures in the PDB.
For the nonredundant PDB data set containing 9081 proteins considered in this article, the mean Matthews coefficient was 2.68 (0.78) Å3/Da and the median was 2.48 Å3/Da. This is in agreement with the results of Kantardjieff and Rupp (2003), who reported a mean VM of 2.69 Å3/Da and a median of 2.52 Å3/Da for their set of 10,471 protein crystal structures.
We analyzed correlations between the Matthews coefficient (solvent content) distribution and the crystal system, cell type, and space group. The mean Matthews coefficient varies as a function of crystal system (Table 1; Fig. 1). By treating the seven crystal systems as a set of equally spaced categories, ordered by increasing degree of symmetry, a linear regression of mean VM weighted by standard deviation is statistically significant (Fig. 2). Thus there is a correlation between the mean VM and symmetry, where higher symmetry correlates to higher mean VM values. By degree of symmetry we use the L-value described by Wukovitz and Yeates (1995), where L is a number of independent parameters for describing the unit cell (for example, L = 1 for the cubic system and L = 6 for the triclinic system). As the L value decreases, the “degree of symmetry” increases. In the case of trigonal, tetragonal, and hexagonal systems, which have the same value of L(=2), the degree of fold symmetry about the main rotational axis for each system was used as an additional criterion for arranging the systems. For example, the trigonal system is characterized by a threefold axis and the tetragonal system by a fourfold axis, so the tetragonal system was considered to have a higher degree of symmetry. The mean VM also varies by cell type (Table 1). Primitive and C-centered unit cells have a similar mean VM, while R(H), I- and F-centered cells have higher mean values of VM with the maximum value for F-centered cells.
Table Table 1.. Mean, median, and mode Matthews coefficients and solvent contents of protein structures in the test set as a function of crystal system or cell centering type
For Bravais lattices, the mean VM in the orthorhombic and cubic systems depends on the lattice type (Table 2). In orthorhombic crystals the mean VM is about 10% higher in F than in P lattices. Similar differences are observed between the primitive cubic (cP) and face-centered cubic (cF) lattices. Table 3 lists the space groups with more than 40 representatives in the test set studied. Interestingly, the difference in mean VM between most of those 28 space groups is not larger than 10%, in most cases within a standard deviation. Only space groups I4, I422, I4122, and P6322 have a significantly higher mean VM.
Table Table 2.. Mean, median, and mode Matthews coefficients and solvent contents of protein structures in the test set as a function of Bravais lattice
Table Table 3.. Mean, median, and mode Matthews coefficients and solvent contents of the protein structure in the test set as a function of the 28 most frequent space groups
We analyzed the mean values of VM as a function of protein chains in the asymmetric unit (AU), as shown in Table 4. There is no significant correlation between number of protein chains in the AU and the mean VM. However, structures with one, two, or four molecules in the AU have a slightly lower mean solvent content than others. There is also no observed relationship between mean VM and protein oligomeric state (Table 5).
Table Table 4.. Mean, median, and mode Matthews coefficients and solvent contents in the test set of protein structures, as a function of the number of polypeptide chains per AU
Table Table 5.. Mean, median, and mode Matthews coefficients and solvent contents in the test set of protein structures as a function of protein oligomeric state
Symmetry of protein crystals
Distributions of space groups in the nonredundant set of crystal structures are shown in Table 6 and in supplemental Table S1. The most frequent space groups are P212121, assigned to about 21% of the structures, followed by P21 and C2, observed in 15.6% and 11.0% of the structures, respectively. The 28 space groups listed in Table 3 represent 93% of all the crystal structures in the test set.
Table Table 6.. Space group distribution
Results obtained from our nonredundant set of proteins show only very small differences in the distribution of protein crystal structures between enantiomorphous pairs of space groups, contrary to previous reports (Andersson and Hovmöller 2000). Only for the pair of P6122/P6522 is there a significant difference in frequency seen, as P6522 is about 15% more common (Table 6; supplemental Table S1).
Most proteins from the test set crystallized in the orthorhombic and the monoclinic systems (Table 1). More crystals belong to the tetragonal system than the trigonal system, but if we consider crystal families, the hexagonal family (comprising the hexagonal and trigonal systems) is more “popular” than tetragonal, representing almost 20% of all of the crystals in the set. The cubic system is represented by the smallest number of structures, and the triclinic system also has relatively few representatives.
More than 75% of protein structures from the test set crystallized in the primitive lattice. The C-face-centered lattice is the second most common. I-, F-centered, and R(H) lattices are observed in less than 10% of the structures of the test set. The F-centered lattice is very unusual—it is represented by only 50 structures in the test set. The primitive orthorhombic lattice is the most common Bravais lattice type in the test set and is observed in 27% of the protein structures. Primitive monoclinic cells are also very popular. Over 43% of all structures analyzed in this paper have symmetry described by the primitive orthorhombic (oP) and primitive monoclinic (mP) lattices. The primitive tetragonal (tP), monoclinic base-centered (mC), primitive trigonal, and primitive hexagonal (hP) lattices are assigned to 10.5%, 9.4%, 8.7%, and 8.4% of the crystals, respectively.
The relative frequency distribution by crystal system was considered as a function of the oligomeric state of the AU. For the sets of protein structures with monomers or dimers in the AU, the relative frequency distribution by crystal system is similar to that seen for all structures (Fig. 3A). However, for higher-ordered oligomers, distributions differ (Fig. 3B). For trimers and hexamers, the hexagonal and trigonal systems are overrepresented (in other words, seen relatively more frequently as compared to the monomer distribution). As should be expected, a relatively large percentage of trimers crystallized in the cubic system as well. For tetramers, the tetragonal and orthorhombic systems are overrepresented, and for octamers, the tetragonal crystal system is the most popular (Fig. 3B). Overall, most of the structures (41%) have one polypeptide chain per asymmetric unit, 34% have two chains per AU, and 12% have four chains per AU. Relatively few structures have five polypeptide chains per AU, almost none have seven, and only about 2% of the protein structures contained more than eight chains per AU. Frequency distribution by oligomeric state for the most frequently observed space groups is shown in Table 7.
Table Table 7.. Frequency distribution by oligomeric state
Removing structures with similar sequences in the PDB to select a nonredundant set of protein structures leads to mean and median values of the Matthews coefficient similar to those obtained by Kantardjieff and Rupp (2003). The similar overall mean and median values of VM suggest that the possible bias introduced by the clustering algorithm used during the selection process (Li et al. 2001) may be negligible.
Our results for frequencies of most popular space groups in the PDB generally agree with previous studies (Padmaja et al. 1990; Wukovitz and Yeates 1995; Pellegrini et al. 1997; Andersson and Hovmöller 2000). Although the number of PDB deposits has increased more than 100-fold since the analysis by Padmaja et al. (1990), and relative frequencies of the space groups have changed, we found that the three most popular space groups (P212121, C2, and P21) still contributed about 50% of our set of nonredundant protein crystal structures.
One of our most interesting observations is the slight but statistically significant overall correlation of mean VM with the crystal system (Figs. 1, 2). In general, higher symmetry is related to higher solvent content. The mean solvent content in cubic crystals is more than 10% higher than in triclinic, monoclinic, and orthorhombic systems. However, this correlation seems to be unrelated to the frequencies of the crystal systems.
Generally, less frequent space groups in the PDB tend to have relatively higher solvent content. However, it is not a universal correlation: The triclinic, monoclinic, and orthorhombic systems (or just the P1, P21, and P212121 space groups), have very similar mean Matthews coefficients, but their frequencies in the PDB are very different. Therefore, the solvent content alone cannot be used to explain the frequency distribution of protein crystal structures in the PDB among different crystal systems, cell types, Bravais lattices, or space groups.
The overall frequency distribution of crystal symmetries cannot be explained alone by the fact that crystal forms of higher symmetry are usually simpler to work with in crystallographic data collection and processing (which might explain why triclinic crystals appear to be underrepresented in the PDB). Close packing and entropic theories also cannot fully explain such phenomena—for example, they cannot account for the differences between the P21 and P1 space groups. Although the entropic model still does not explain some details of the frequencies of space group symmetry, it currently provides the best description of the nonuniformity. It is especially visible upon review of the results presented in Table 3 considering the number of rigid-body degrees of freedom (D) introduced by Wukovitz and Yeates (1995). The space groups for which the D-value are highest (D = 6 or D = 7) occupy the 13 top positions in Table 3, as the frequency of the P6522 (D = 6) space group is the same as R32 (D = 5).
The theory of “good” and “bad” symmetry operators may also be helpful (Filippini and Gavezzotti 1992). If we assume that the hierarchy of symmetry operators established for organic crystals is also valid for protein crystals, the difference in the number of protein crystal structures reported in P21 and P1 may be explained. During nucleation and/or crystal growth, twofold screw axes may contribute a larger part of the packing energy than translation alone, which may be affected by the shape of the molecule and by the internal symmetry of building blocks. As pointed out by Filippini and Gavezzotti (1992), objects of irregular shape transformed by twofold axes and mirror planes are influenced by bump-to-bump interactions. Inversion centers, screw axes, and glide planes are better for close packing, which are generated by bump-to-hollow interactions.
Translation is also assumed to be a “good” symmetry operator—P1 is the sixth most frequent space group (Table 3). Translation is better for formation of close-packed structures than two-, three-, four-, and sixfold axes. This, at first, may not explain the relative frequency of C2, which is the third most often observed space group, after P212121 and P21. However, the model does apply, because the combination of C-centering and a twofold axis yields a twofold screw axis. The results summarized in Table 6 indicate that space groups containing only ordinary crystallographic axes (two-, three-, four-, and sixfold) without screw axes are seldom observed. A similar observation was made by Wilson (1988) for small molecular crystals. Brock (1996) and Pidcock et al. (2003) also noticed that groups with three-, four-, and sixfold rotation axes occur mainly when molecules or ions are located in special positions on the axes. Based on the space group frequencies observed here, it is possible to classify four- and sixfold screw axes according to their level of frequency, yielding the ordering 41 = 43 > 42 and 61 = 65 > 63 > 62 = 64, respectively. It is interesting that space groups with a symmetry composed of a combination of a three-, four-, or sixfold screw axes with twofold axes appear much more frequently than equivalent space groups with screw axes alone (though this trend is not seen in the tetragonal crystal system). For example, P6122 is found in more structures than P61. It could be explained assuming that crystallographic twofold axes are coincident with the twofold axes that form protein oligomers. It was recently pointed out that proteins forming oligomers tend to crystallize better than monomeric proteins (Banatao et al. 2006).
High solvent contents are characteristic for protein crystals. Moreover, protein crystals also tend to have more molecules in the AU than do small molecule crystals. A single polypeptide chain in the AU (Table 4) is observed in 41% of the protein structures, while 92% of small molecule crystals have one (or fewer) formula units per AU (Padmaja et al. 1990). Newer reports (Steiner 2000) show that the relative frequencies of the number of molecules in the AU are strongly dependent on crystal system and space group. Steiner reports that organic crystal structures with more than one molecule in the asymmetric unit make up 10% of all organic crystal structures. For steroids, nucleosides, and nucleotides this value reaches 20%. The last column in Table 4 shows that even numbers of polypeptide chains in the AU are preferred (except for single-chain structures). The analysis also shows that if only structures with two or more chains in the AU are considered, the frequency of structures with an even number of polypeptide chains per AU (two, four, or six) is higher than the frequency of structures with an odd number of polypeptides per AU (three, five, or seven). Structures with five or seven polypeptide chains per AU are especially rare. Surprisingly, more detailed study reveals that for crystal structures described in certain space groups, especially P1, P21, P21212, P43, I4, P3, P31, P32, and R3, two molecules in AU are more often observed than one.
Recently, an interesting approach to protein crystallization by synthetic symmetrization was proposed (Banatao et al. 2006). The distribution of different oligomeric forms in particular space groups (Table 7) and crystal systems (Fig. 3) confirms previous observations showing that internal symmetry of the oligomer could be an advantage during crystallization and such symmetry could be utilized as part of the crystal symmetry. This is observed for the C222, C2221, I222, F222, I4122, and P6322 space groups, where more crystal structures are composed of dimers rather than monomers. Similarly, for trimeric proteins in space groups P63 and P213 trimers are observed more often than other oligomeric states. The relative distributions of monomers and dimers in the crystal systems (Fig. 3A) are very similar, but in the case of higher oligomeric assemblies (Fig. 3B) the distribution is strongly influenced by the fact that the internal symmetry of the oligomer may be adopted by crystal symmetry. The most striking examples of such behavior are trimers in trigonal system and tetramers in orthorhombic systems. Another example is the relative absence of pentamers and heptamers, which could be attributed to the inability of these assemblies to be incorporated into crystal symmetry.
Andersson and Hovmöller (2000) reported that frequency distributions of enantiomorphous pairs of space groups differ significantly. They reported that P3121 was about twice as common as P3221 and that this was also the case for the pairs of P41212/P43212 and P6122/P6522. However, such a conclusion appears to be an artifact of not using a nonredundant data set. For example, in the PDB as of September 2007, 1602 structures were solved in P43212, versus 1365 structures solved in P41212. However, 143 of the P43212 structures were all of tetragonal lysozyme. In contrast, our analysis, which took into account only crystals of nonredundant proteins in the PDB, showed only minor differences in frequencies between enantiomorphous pairs of space groups.
The distribution of space-group frequencies may be affected by errors in the reported crystal symmetry. As was pointed out for small molecules (Mighell et al. 1983; Baur and Kassner 1992; Herbstein and Marsh 1998; Marsh 2002; Marsh and Spek 2001; Marsh et al. 2002), many structures are reported in the wrong symmetry, and in most cases, this wrong symmetry is unnecessarily low (Baur and Tillmanns 1986). In general, there are two types of mistakes in choosing space groups: incorrect space group and correct system or both incorrect space group and system (Marsh 1995; Marsh and Bernal 1995).
We have shown that proteins crystallizing in lower symmetry tend to have a lower solvent content compared to those crystallizing in higher symmetry systems, but the solvent content is not correlated with the distribution of the frequency of crystal structures among different Bravais lattices or specific space groups. We also found that some oligomeric assemblies, like trimers, tetramers, and hexamers, prefer to crystallize in systems where the oligomer symmetry can be incorporated in the crystal symmetry. Moreover, the frequency distributions of enantiomorphous pairs of space groups do not differ significantly. The space group frequencies obtained after analysis of a large nonredundant set of proteins agrees surprisingly well with results obtained by Wukovitz and Yeates (1995) and may be treated as an additional validation of their model.
Materials and Methods
A nonredundant set of protein structures determined by X-ray crystallography was obtained based on the PDB as of February 2007, by clustering proteins at 90% sequence identity (Holm and Sander 1998; Li et al. 2001). Viruses and protein–DNA or protein–RNA complexes were excluded, but protein–protein complexes were included. Our final test set contained 9081 structures, where no structure had greater than 90% sequence identity to any other structure in the set. Use of a clustering algorithm (Li et al. 2001) eliminated bias related to redundant structures in the PDB; however, it might have introduced a slight bias of its own. First, chains of 20 amino acids or less were discarded. Second, the algorithm used resolution and R-value as indicators of the quality of structures to select cluster representatives. Therefore, the test group contained the structures of the highest-resolution crystals in each cluster, and resolution has been observed to be a significant discriminator of VM (Kantardjieff and Rupp 2003).
The volumes of unit cells were calculated using the unit cell parameters in the CRYST1 records from PDB files. Molecular weights of the polypeptide chains in the asymmetric units were calculated using the sequence information in the SEQRES records and the molecular weights of non-polymerized amino acids. This has limited accuracy as the SEQRES records sometimes give longer sequences than the real sequence of the chains crystallized. Only standard residues were considered: Selenomethionines were treated as methionines. The Matthews coefficient and solvent content were calculated using Matthews' original assumptions (Matthews 1968). Only standard space groups were considered, e.g., structures described in A2, B2, and I2 space groups were considered as C2.
The oligomeric states of the proteins were assigned using PITA v0.4.3 (Ponstingl et al. 2003). All oligomeric states with a PITA score higher than 70 were considered to be potentially significant. When there were multiple possible states detected by the PITA program, the oligomeric state with the largest number of oligomers was used if its score was 80% or greater of the highest returned PITA score.
The authors thank Anna Gawlicka-Chruszcz and Zbigniew Dauter for helpful discussions. The work was supported by NIH grants GM74942 and GM53163.