• Wiley Online Library will be disrupted on 26 May from 10:00-12:00 BST (05:00-07:00 EDT) for essential maintenance

SEARCH

SEARCH BY CITATION

There is no dispute that the overwhelming majority of the 50,000 structures in the Protein Data Bank are essentially correct (i.e., lacking major experimental error). Nevertheless, there have been and there continue to be reports of protein structures which are seriously in error. A number of these are listed in Table I.

Table I. X-ray Protein Structures That Required Substantive Revision
ProteinYearResolution (Å)Reference
Initial reportRevision or retractionInitial reportRevision
Ferredoxin from Azobacter vinelandii198219882.03.0Ghosh et al.1
Stout et al.2
Gene V protein198319952.31.8Brayer and McPherson3
Skinner et al. 19944
6-Phosphogluconate dehydrogenase198319912.62.5Adams et al.5
Adams et al.6
Nucleosome histone core198519943.33.1Burlingame et al.7
Wang et al.8
EcoRI endonuclease198619903.02.8McClarin et al.9
Kim et al.10
Phosphocarrier protein HPr198719932.82.0El-Kabbani et al.11
Jia et al.12
Interleukin-2198719923.02.5Brandhuber et al.13
Bazan14
Rubisco small subunit198819892.62.8Chapman et al.15
Knight et al.16
Yeast enolase198819892.252.25Lebioda and Stec17
Lebioda et al.18
Glutaminase-asparaginase198819943.02.9Ammon et al.19
Lubkowski et al.20
Phosphoactive yellow protein198919952.41.4McRee et al.21
Borgstahl et al.22
Serum albumin199219982.82.5He and Carter23
Curry et al.24
MsbA ABC transporter200120064.5Chang and Roth25
Dawson and Locher26
Chang et al.27
EmrE multidrug transporter200520063.7Pornillos et al.28
Chang et al.27

It might be argued that as crystallographers have improved their techniques, and learned by experience, they should become “better” at solving structures. Table I suggests that this is not necessarily the case. Indeed, it is remarkable that all of the earliest protein structures to be determined at atomic resolution (myoglobin, lysozyme, chymotrypsin, carboxypeptidase, ribonuclease) have proven to be essentially correct and have not needed substantive revision. When these structures were first reported, now all over 40 years ago, the protagonists never knew quite what to expect. Also, in most cases the investigators had little if any experience in interpreting protein electron density maps. All of these early structures were determined using fairly high resolution data (∼2.5–2.0 Å) and multiple heavy-atom derivatives (because this is what had worked for myoglobin). Looking back, some of these structures were probably “overdetermined,” in the sense that the basic fold of the protein might have been found using a subset of the original data. The risk of using too few observations became apparent later (Table I).

Not surprisingly, and for a good reason, structures found to be in error have prompted the development of checks (i.e., “model validation”), which can be used to try to ensure that similar mistakes do not occur in the future.29

A number of these tests are based on expected stereochemistry. Are bond lengths and angles within normal limits? Are the Ramachandran angles acceptable? Are there nonbonded atoms that are placed too close to each other? At the same time, such tests need to be applied with discretion.30 For example, Hooft et al.31 published a list of “errors” in protein structures in the PDB based on violations of various geometric criteria. They considered all D-amino acids in the Protein Data Bank to be “errors,” but some of these, for example, were from protein complexes with cell-wall fragments that contain bonafide D-amino acids. Others were from the antibiotic gramicidin, which also contains D-amino acids.

Other tests of model structures are more crystallographic in nature. Are the crystallographic refinement residuals R and Rfree sufficiently low? Do the B-factors (indicative of mobility) correlate with solvent exposure? Does the placement of the atoms in the model match the local electron density? Yet, other tests are based on energetic considerations. Do polar atoms have polar neighbors and, likewise, are nonpolar atoms in nonpolar environments?

Notwithstanding some limitations, tests such as these provide an overall way to estimate the quality of structural models in the PDB, and this is the subject of a recent report by Brown and Ramaswamy.32 To use their words, “the most striking result is the association between structure quality and the journal in which the structure was first published. The worst offenders are the apparently high-impact general science journals that include Cell, Science, Molecular Cell and Nature. The rush to publish high-impact work in the competitive atmosphere may have led to the proliferation of poor-quality structures.”

In the current issue of Protein Science, Sheffler and Baker33 introduce a new method to test for the integrity of structure models based on core packing. Their approach does highlight specific entries in the PDB, which have been questioned as being unreliable. It also points to a type of error that might occur for other structures and would be quite difficult to detect, namely, a mistake in the dimensions of the crystallographic unit cell. In a typical X-ray structure analysis, the dimensions of the unit cell are routinely obtained to high precision during X-ray data collection and are subsequently deposited in the PDB along with the crystallographic coordinates. As noted by Sheffler and Baker, PDB entry 179L corresponds to a mutant of T4 lysozyme, which was inadvertently refined with the a and b cell dimensions 10% too large (entry 177L corresponds to the same mutant refined with the correct unit cell). In the mathematical description of diffraction, the coordinates are in fractional (dimensionless) units, not in Angstroms. If a structure is refined with the wrong unit cell dimensions, the fractional coordinates will remain essentially the same and the calculated structure factor amplitudes will also change little (see the next paragraph). The consequences of the error in the cell dimensions become more apparent when the fractional coordinates are converted into absolute values (i.e., Angstroms). In the case of the lysozyme mutant, a and b were 10% too large. This meant that the refined model was “stretched” by 10% in the directions of the a and b cell edges. This in turn caused unusual cavities in the core, which were recognized by the Sheffler and Baker procedure. In terms of the crystallographic refinement, this “stretching” of the model distorts individual bond lengths and angles but this was not apparent from the standard checking procedures.

The insensitivity of the refinement process to substantial errors in the cell constants is at first surprising. If one takes a particular model from the PDB and performs a series of parallel refinements where the cell constants are systematically varied, one finds that the resulting R-values increase rapidly as one moves away from the “true” cell constants. This result, however, does not imply that an error in the cell constants will necessarily be signaled by an increase in the R-factor. If the coordinates of the protein molecule are in Angstroms, as are the entries in the PDB, then a change Δa in one of the cell dimensions will, in effect, cause a rigid-body translation of the protein parallel to that cell edge, with magnitude between zero and ±Δa depending on the location of the molecule in the unit cell. If Δa is sufficiently large (e.g., several Angstroms), the shift in the protein coordinates may be outside the range of convergence of the refinement procedure and the R-value will remain high. If, however, the starting crystallographic coordinates are converted to fractional values before making the change in the cell constant then the R-factor will change very little and the refinement procedure will be very insensitive to the error in the cell edge.

In the case of the lysozyme mutant, the error in the cell dimensions was substantial and was recognized by the cavity calculation. A smaller error would be much harder to identify. As an example, suppose one had a crystal in which all three cell dimensions were too large by 1%. (Such an error might occur, if the crystal-to-detector distance or the X-ray wavelength was specified incorrectly.) The subsequent refinement of the structure would seem normal and the error would be almost impossible to detect by the standard tests. The refined model, however, would be expanded by 1%. Such an error might be detected if the structure was compared with a close homolog but it would presumably require more than a routine calculation of the root-mean-square difference between the two structures. In general, an error in one or other of the cell dimensions of a crystalline protein would be expected to be most apparent in global features of the whole protein such as density, radius of gyration, or overall dimensions.

In the early days of structural biology, the following question was sometimes discussed: “If one were working on the crystal structure of a heretofore unknown protein, would it be possible to generate a ‘fake’ model of the structure that would be sufficiently plausible to be accepted as the correct structure?” The general consensus at the time was that to do so would require more work than to simply determine the correct structure experimentally. Nevertheless, experience has shown that the identification of incorrect structures is not a trivial matter, has typically taken 5–10 years (Table I), and in some instances is yet to be resolved.34, 35 It might also be noted (Table I) that the resolution of the diffraction data for the structure reports that had to be withdrawn is quite variable (4.5–2.3 Å). Furthermore, several of the structural revisions were based on data to lower resolution than the initial report. Higher resolution helps but is not, of itself, a guarantee for reliability.

The experience gained from studying the proteins listed in Table I has served to highlight some of the “danger signals” that may indicate questionable crystallographic models. Early experience with ferredoxin, for example, emphasized the risk of including an excessively large number of water molecules in the crystallographic model.2 The error with the lysozyme mutant mentioned above occurred because the cell constants and structure factor amplitudes were recorded separately.33 It could have been avoided by the trivial change of keeping both of these quantities in the same file. For both teaching and research purposes, it would be very helpful if the PDB could maintain a directory including the coordinates and structure amplitudes for all (or as many as possible) of the proteins listed in Table I, both the initial reports and any subsequent revision. Easy access to these data would facilitate the development of new and better methods for structure validation.

References

  1. Top of page
  2. References