Commentary
Sorting the chaff from the wheat at the PDB
Article first published online: 2 DEC 2008
DOI: 10.1002/pro.13
Copyright © 2008 The Protein Society
Additional Information
How to Cite
Tronrud, D. E. and Matthews, B. W. (2009), Sorting the chaff from the wheat at the PDB. Protein Science, 18: 2–5. doi: 10.1002/pro.13
Publication History
- Issue published online: 16 DEC 2008
- Article first published online: 2 DEC 2008
- Manuscript Received: 20 OCT 2008
- Manuscript Accepted: 20 OCT 2008
- Abstract
- Article
- References
- Cited By
There is no dispute that the overwhelming majority of the 50,000 structures in the Protein Data Bank are essentially correct (i.e., lacking major experimental error). Nevertheless, there have been and there continue to be reports of protein structures which are seriously in error. A number of these are listed in Table I.
| Protein | Year | Resolution (Å) | Reference | ||
|---|---|---|---|---|---|
| Initial report | Revision or retraction | Initial report | Revision | ||
| Ferredoxin from Azobacter vinelandii | 1982 | 1988 | 2.0 | 3.0 | Ghosh et al.1 |
| Stout et al.2 | |||||
| Gene V protein | 1983 | 1995 | 2.3 | 1.8 | Brayer and McPherson3 |
| Skinner et al. 19944 | |||||
| 6-Phosphogluconate dehydrogenase | 1983 | 1991 | 2.6 | 2.5 | Adams et al.5 |
| Adams et al.6 | |||||
| Nucleosome histone core | 1985 | 1994 | 3.3 | 3.1 | Burlingame et al.7 |
| Wang et al.8 | |||||
| EcoRI endonuclease | 1986 | 1990 | 3.0 | 2.8 | McClarin et al.9 |
| Kim et al.10 | |||||
| Phosphocarrier protein HPr | 1987 | 1993 | 2.8 | 2.0 | El-Kabbani et al.11 |
| Jia et al.12 | |||||
| Interleukin-2 | 1987 | 1992 | 3.0 | 2.5 | Brandhuber et al.13 |
| Bazan14 | |||||
| Rubisco small subunit | 1988 | 1989 | 2.6 | 2.8 | Chapman et al.15 |
| Knight et al.16 | |||||
| Yeast enolase | 1988 | 1989 | 2.25 | 2.25 | Lebioda and Stec17 |
| Lebioda et al.18 | |||||
| Glutaminase-asparaginase | 1988 | 1994 | 3.0 | 2.9 | Ammon et al.19 |
| Lubkowski et al.20 | |||||
| Phosphoactive yellow protein | 1989 | 1995 | 2.4 | 1.4 | McRee et al.21 |
| Borgstahl et al.22 | |||||
| Serum albumin | 1992 | 1998 | 2.8 | 2.5 | He and Carter23 |
| Curry et al.24 | |||||
| MsbA ABC transporter | 2001 | 2006 | 4.5 | — | Chang and Roth25 |
| Dawson and Locher26 | |||||
| Chang et al.27 | |||||
| EmrE multidrug transporter | 2005 | 2006 | 3.7 | — | Pornillos et al.28 |
| Chang et al.27 | |||||
It might be argued that as crystallographers have improved their techniques, and learned by experience, they should become “better” at solving structures. Table I suggests that this is not necessarily the case. Indeed, it is remarkable that all of the earliest protein structures to be determined at atomic resolution (myoglobin, lysozyme, chymotrypsin, carboxypeptidase, ribonuclease) have proven to be essentially correct and have not needed substantive revision. When these structures were first reported, now all over 40 years ago, the protagonists never knew quite what to expect. Also, in most cases the investigators had little if any experience in interpreting protein electron density maps. All of these early structures were determined using fairly high resolution data (∼2.5–2.0 Å) and multiple heavy-atom derivatives (because this is what had worked for myoglobin). Looking back, some of these structures were probably “overdetermined,” in the sense that the basic fold of the protein might have been found using a subset of the original data. The risk of using too few observations became apparent later (Table I).
Not surprisingly, and for a good reason, structures found to be in error have prompted the development of checks (i.e., “model validation”), which can be used to try to ensure that similar mistakes do not occur in the future.29
A number of these tests are based on expected stereochemistry. Are bond lengths and angles within normal limits? Are the Ramachandran angles acceptable? Are there nonbonded atoms that are placed too close to each other? At the same time, such tests need to be applied with discretion.30 For example, Hooft et al.31 published a list of “errors” in protein structures in the PDB based on violations of various geometric criteria. They considered all D-amino acids in the Protein Data Bank to be “errors,” but some of these, for example, were from protein complexes with cell-wall fragments that contain bonafide D-amino acids. Others were from the antibiotic gramicidin, which also contains D-amino acids.
Other tests of model structures are more crystallographic in nature. Are the crystallographic refinement residuals R and Rfree sufficiently low? Do the B-factors (indicative of mobility) correlate with solvent exposure? Does the placement of the atoms in the model match the local electron density? Yet, other tests are based on energetic considerations. Do polar atoms have polar neighbors and, likewise, are nonpolar atoms in nonpolar environments?
Notwithstanding some limitations, tests such as these provide an overall way to estimate the quality of structural models in the PDB, and this is the subject of a recent report by Brown and Ramaswamy.32 To use their words, “the most striking result is the association between structure quality and the journal in which the structure was first published. The worst offenders are the apparently high-impact general science journals that include Cell, Science, Molecular Cell and Nature. The rush to publish high-impact work in the competitive atmosphere may have led to the proliferation of poor-quality structures.”
In the current issue of Protein Science, Sheffler and Baker33 introduce a new method to test for the integrity of structure models based on core packing. Their approach does highlight specific entries in the PDB, which have been questioned as being unreliable. It also points to a type of error that might occur for other structures and would be quite difficult to detect, namely, a mistake in the dimensions of the crystallographic unit cell. In a typical X-ray structure analysis, the dimensions of the unit cell are routinely obtained to high precision during X-ray data collection and are subsequently deposited in the PDB along with the crystallographic coordinates. As noted by Sheffler and Baker, PDB entry 179L corresponds to a mutant of T4 lysozyme, which was inadvertently refined with the a and b cell dimensions 10% too large (entry 177L corresponds to the same mutant refined with the correct unit cell). In the mathematical description of diffraction, the coordinates are in fractional (dimensionless) units, not in Angstroms. If a structure is refined with the wrong unit cell dimensions, the fractional coordinates will remain essentially the same and the calculated structure factor amplitudes will also change little (see the next paragraph). The consequences of the error in the cell dimensions become more apparent when the fractional coordinates are converted into absolute values (i.e., Angstroms). In the case of the lysozyme mutant, a and b were 10% too large. This meant that the refined model was “stretched” by 10% in the directions of the a and b cell edges. This in turn caused unusual cavities in the core, which were recognized by the Sheffler and Baker procedure. In terms of the crystallographic refinement, this “stretching” of the model distorts individual bond lengths and angles but this was not apparent from the standard checking procedures.
The insensitivity of the refinement process to substantial errors in the cell constants is at first surprising. If one takes a particular model from the PDB and performs a series of parallel refinements where the cell constants are systematically varied, one finds that the resulting R-values increase rapidly as one moves away from the “true” cell constants. This result, however, does not imply that an error in the cell constants will necessarily be signaled by an increase in the R-factor. If the coordinates of the protein molecule are in Angstroms, as are the entries in the PDB, then a change Δa in one of the cell dimensions will, in effect, cause a rigid-body translation of the protein parallel to that cell edge, with magnitude between zero and ±Δa depending on the location of the molecule in the unit cell. If Δa is sufficiently large (e.g., several Angstroms), the shift in the protein coordinates may be outside the range of convergence of the refinement procedure and the R-value will remain high. If, however, the starting crystallographic coordinates are converted to fractional values before making the change in the cell constant then the R-factor will change very little and the refinement procedure will be very insensitive to the error in the cell edge.
In the case of the lysozyme mutant, the error in the cell dimensions was substantial and was recognized by the cavity calculation. A smaller error would be much harder to identify. As an example, suppose one had a crystal in which all three cell dimensions were too large by 1%. (Such an error might occur, if the crystal-to-detector distance or the X-ray wavelength was specified incorrectly.) The subsequent refinement of the structure would seem normal and the error would be almost impossible to detect by the standard tests. The refined model, however, would be expanded by 1%. Such an error might be detected if the structure was compared with a close homolog but it would presumably require more than a routine calculation of the root-mean-square difference between the two structures. In general, an error in one or other of the cell dimensions of a crystalline protein would be expected to be most apparent in global features of the whole protein such as density, radius of gyration, or overall dimensions.
In the early days of structural biology, the following question was sometimes discussed: “If one were working on the crystal structure of a heretofore unknown protein, would it be possible to generate a ‘fake’ model of the structure that would be sufficiently plausible to be accepted as the correct structure?” The general consensus at the time was that to do so would require more work than to simply determine the correct structure experimentally. Nevertheless, experience has shown that the identification of incorrect structures is not a trivial matter, has typically taken 5–10 years (Table I), and in some instances is yet to be resolved.34, 35 It might also be noted (Table I) that the resolution of the diffraction data for the structure reports that had to be withdrawn is quite variable (4.5–2.3 Å). Furthermore, several of the structural revisions were based on data to lower resolution than the initial report. Higher resolution helps but is not, of itself, a guarantee for reliability.
The experience gained from studying the proteins listed in Table I has served to highlight some of the “danger signals” that may indicate questionable crystallographic models. Early experience with ferredoxin, for example, emphasized the risk of including an excessively large number of water molecules in the crystallographic model.2 The error with the lysozyme mutant mentioned above occurred because the cell constants and structure factor amplitudes were recorded separately.33 It could have been avoided by the trivial change of keeping both of these quantities in the same file. For both teaching and research purposes, it would be very helpful if the PDB could maintain a directory including the coordinates and structure amplitudes for all (or as many as possible) of the proteins listed in Table I, both the initial reports and any subsequent revision. Easy access to these data would facilitate the development of new and better methods for structure validation.
References
- 1, , , , ( 1982) Iron-sulfur clusters and protein structure of Azotobacter ferredoxin at 2.0 Å resolution. J Mol Biol 158: 73–109.
- 2, , , ( 1988) Structure of ferredoxin I from Azotobacter vinelandii. Proc Natl Acad Sci USA 85: 1020–1022.
- 3, ( 1983) Refined structure of the gene 5 DNA binding protein from bacteriophage fd. J Mol Biol 169: 565–596.
- 4, , , , , , , , , ( 1994) Structure of the gene V protein of bacteriophage f1 determined by multiwavelength X-ray diffraction on the selenomethionyl protein. Proc Natl Acad Sci USA 91: 2071–2075.
- 5, , , , , , , ( 1983) The three dimensional structure of sheep liver 6-phosphogluconate dehydrogenase at 2.6 Å resolution. EMBO J 2: 1009–1014.
- 6, , , , ( 1991) The structure of 6-phosphogluconate dehydrogenase refined at 2.5 Å resolution. Acta Crystallogr B 47: 817–820.Direct Link:
- 7, , , , , ( 1985) Crystallographic structure of the octameric histone core of the nucleosome at a resolution of 3.3 Å. Science 228: 546–553.
- 8, , , ( 1994) The octameric histone core of the nucleosome. Structural issues resolved. J Mol Biol 236: 178–188.
- 9, , , , , , ( 1986) Structure of the DNA-Eco RI endonuclease recognition complex at 3 Å resolution. Science 234: 1526–1541.
- 10, , , , ( 1990) Refinement of EcoRI endonuclease crystal structure: a revised protein chain tracing. Science 249: 1307–1309.
- 11, , ( 1987) Tertiary structure of histidine-containing protein of the phosphoenolpyruvate:sugar phosphotransferase system of Escherichia coli. J Biol Chem 262: 12926–12929.
- 12, , , ( 1993) The 2.0-Å resolution structure of Escherichia coli histidine-containing phosphocarrier protein HPr. J Biol Chem 268: 22490–22501.
- 13, , , ( 1987) Three-dimensional structure of interleukin-2. Science 238: 1707–1709.
- 14( 1992) Unraveling the structure of IL-2. Science 257: 410–413.
- 15, , , , , ( 1988) Tertiary structure of plant RuBisCO: domains and their contacts. Science 241: 71–74.
- 16, , ( 1989) Reexamination of the three-dimensional structure of the small subunit of RuBisCo from higher plants. Science 244: 702–705.
- 17, ( 1988) Crystal structure of enolase indicates that enolase and pyruvate kinase evolved from a common ancestor. Nature 333: 683–686.
- 18, , ( 1999) The structure of yeast enolase at 2.25 Å resolution. J Biol Chem 264: 3685–3693.
- 19, , , , , , , ( 1988) Preliminary crystal structure of Acinetobacter glutaminasificans glutaminase-asparaginase. J Biol Chem 263: 150–156.
- 20, , , , , , ( 1994) Refined crystal structure of Acinetobacter glutaminasificans glutaminase-asparaginase. Acta Crystallogr D Biol Crystallogr 50: 826–832.Direct Link:
- 21, , , , , ( 1989) Crystallographic structure of a photoreceptor protein at 2.4 Å resolution. Proc Natl Acad Sci USA 86: 6533–6537.
- 22, , ( 1995) 1.4 Å structure of photoactive yellow protein, a cytosolic photoreceptor: unusual fold, active site, and chromophore. Biochemistry 34: 6278–6287.
- 23, ( 1992) Atomic structure and chemistry of human serum albumin. Nature 358: 209–215.
- 24, , , ( 1998) Crystal structure of human serum albumin complexed with fatty acid reveals an asymmetric distribution of binding sites. Nat Struct Biol 5: 827–835.
- 25, ( 2001) Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters. Science 293: 1793–1800.
- 26, ( 2006) Structure of a bacterial multidrug ABC transporter. Nature 443: 180–185.
- 27, , , , , ( 2006) Retraction: structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters. Science 314:1875.
- 28, , , ( 2005) X-ray structure of the EmrE multidrug transporter in complex with a substrate. Science 310: 1950–1953.
- 29( 2000) Validation of protein crystal structures. Acta Crystallogr D Biol Crystallogr 56: 249–265.Direct Link:
- 30, , ( 1996) Storing diffraction data. Nature 383: 18–19.
- 31, , , ( 1996) Errors in protein structures. Nature 381: 272.
- 32, ( 2007) Quality of protein crystal structures. Acta Crystallogr D Biol Crystallogr 63: 941–950.Direct Link:
- 33, (2009) Assessment of protein core packing for protein structure prediction and validation. Protein Sci.
- 34, , , ( 2007) Crystallographic evidence for deviating C3b structure? Ajees et al. reply. Nature 448: E2–E3.
- 35, , , ( 2007) Crystallographic evidence for deviating C3b structure? Nature 448: E1–E2.

1469-896X/asset/olbannerleft.gif?v=1&s=d218899ae53b2862ab119790ed504b8d72122fb3)
1469-896X/asset/olbannerright.gif?v=1&s=59470eb9a1d9b7b13b1be75e9445e6c46ee2214f)
