Creating a community resource for protein science

Authors

  • Helen M. Berman

    Corresponding author
    1. Department of Chemistry and Chemical Biology, Center for Integrative Proteomics Research, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854
    • Department of Chemistry and Chemical Biology, Center for Integrative Proteomics Research, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854
    Search for more papers by this author

Abstract

Helen Berman is the recipient of the Protein Society 2012 Carl Branden Award

In addition to being one of the early pioneers in protein crystallography, Carl Brändén made significant contributions to science education with his elegant and beautifully illustrated book Introduction to Protein Structure (Brändén and Tooze, New York: Garland, 1991). It is truly an honor to receive this award in their names. This award and the 40th anniversary of the Protein Data Bank (PDB; Berman et al., Structure 2012;20:391–396) have given me an opportunity to reflect on the various components that have contributed to building a resource for protein science and to try to quantify the impact of having PDB data openly available.

Early History of Structural Biology and the Protein Data Bank

With the discovery of Bragg's law in 1912,1 X-ray crystallography began to be used as a method to determine atomic structure. It did not take long for visionary scientists to explore its use for determining the structure of proteins. In 1934, Bernal and Crowfoot Hodgkin produced the first diffraction pattern of a pepsin crystal.2 The determination of the structures of myoglobin3, 4 and then hemoglobin5 earned Perutz and Kendrew Nobel Prizes in 1962. This marked the beginning of an era that has seen extraordinary progress in the use of X-ray crystallography for structure determination of a wide range of biological molecules for which several more Nobel prizes were awarded.6 In the 1990s, nuclear magnetic resonance (NMR) methods began to be exploited for structure determination, and, more recently, 3D electron microscopy has allowed us to visualize the structures of very large molecular machines.

The structures of biological molecules contain a treasure trove of information. There is no doubt that every investigator who determines a structure wants to fully analyze the results of their experiment and probably has the greatest insights into how exactly to do that. At a minimum, these investigators need a place to store their data in a secure space that is preferably not in the local laboratory. But it is also true that others might want to compare, classify, and analyze groups of structures, which would require a way to easily distribute the data. The pioneers of structural biology recognized the necessity for a central repository that could store and distribute structural data, and a group of these scientists stepped forward to take on the task of creating an archive.7 The Protein Data Bank (PDB) was established in 1971 at Brookhaven National Laboratory (BNL) with an initial holding of seven structures.8

Components of the PDB

Management

The initial Protein Data Bank (PDB) was managed as collaboration between BNL and the Cambridge Crystallographic Data Centre.9 Later, a group in Osaka, Japan, joined the collaboration. All data were annotated at BNL. In 1998, when the Research Collaboratory for Protein Research (RCSB) PDB was awarded the contract from the NSF,10 a collaboration was established with the PDBj11 group at Osaka University to collect and process data. At the European Bioinformatics Institute in Hinxton, UK, the Macromolecular Structure Database group12 (now PDBe) also began to collect data. In 2003, the Worldwide PDB (wwPDB) became a formalized collaboration among these three groups who continue to collect and annotate coordinate and related experimental data for the PDB archive.13 Later, the BioMagResBank joined as a collection center for NMR spectral and quantitative data-derived data.14 The purpose of the wwPDB was to ensure that with multiple collection centers, there would be a single global PDB with uniform standards for data processing and validation. A File Transfer Protocol site contains the master archive of data files and is mirrored by the RCSB PDB, PDBe, and PDBj.

Data content

The primary results of a crystal structure determination are the coordinates of every atom in the molecule. For a small protein, there are perhaps 1000 atom sites; for a large one, there are more than 10,000. In a PDB entry, each atom site is identified with atom and residue labels. In addition, there is information about the chemistry of the polymer and small molecule ligands as well as how the structure was determined. For structures determined using X-ray methods, temperature factors and occupancies are included in the atomic site records. Structure factors are also archived, along with restraints and chemical shifts for NMR entries. Electron microscopy (3DEM) entries contain the volume data and the atomic model, where possible.15

The data deposited into the PDB evolved as structural biology matured. In the case of crystallography, rapid advances in data collection, structure determination, and refinement have necessitated the addition of new data items. Synchrotron sources were not used in crystallographic experiments when the PDB began, and proteins were not routinely refined. With the advent of synchrotron beamlines, new methods for data collection and structure determination have evolved. Attitudes about what should be collected for the archive have also changed. Structure factors were rarely deposited before 1990. Now structure factor files accompany every X-ray structure. Chemical shift and restraint data are required for NMR structures. As electron microscopy emerges as a powerful method to determine the structures of macromolecular machines, the wwPDB has created data items to describe the models determined by this method and now collects map volumes.

Although the PDB is an archive and could in principle collect the data “as is,” PDB entries are very carefully curated in order for the data to be truly useful. Without curation, the PDB would merely be a place for the safe keeping of data for the authors. With curation, users in a variety of fields can more easily view and analyze the data. This enables structural biologists to have more confidence in selecting structures either as a model for molecular replacement or for comparison with their own. Likewise, computational biology has been enabled with the availability of a corpus of data in a uniform format.

How is curation done? Biocurators with backgrounds in crystallography, NMR, and 3DEM review every entry and validate the files with the help of a variety of computer tools. The chemistry and geometry of ligands and polymers are checked against known standards, and structure models are checked against the primary data. In addition, secondary structure, biological assembly, and structure determination and refinement parameters are reviewed. Visualizing the structure to make sure that it looks sensible is also a key part of the curation process. Any unusual findings are reported back to the author, who has an opportunity to revise the file and perhaps rerefine the model.

Infrastructure

As the science of structure determination develops and grows, the PDB needs to be able to adapt and expand. The addition of new data items requires a flexible infrastructure as well as the collaboration of those with expert knowledge of the science.

The PDB began as a collection of flat files in an ASCII format, entered on 80-column punched cards. In time, data on magnetic tape were sent by depositors and then distributed by mail. In 1976, there were 23 structures in the archive, and 375 data sets were distributed to laboratories via magnetic tape. A listing of available entries was distributed with the newsletter; there was not any type of database that could be queried.

The PDB file format, which is based on having 80-characters per line, was established in 1974.9 This format has been astonishingly durable and human-readable and is still used by thousands of software packages. However, it has well-known limitations, most especially with respect to the number of atoms a file can accommodate. About 20 years ago, a decision was made to try and define a format that could accommodate many more atoms, better definitions, and the ability to more easily port the data into a database. The Macromolecular Crystallographic Information File (mmCIF) format and dictionary met these criteria and initially supported more than 3000 data items.16 As the archive expands to support new methods and richer data descriptions, new data items are added to the dictionary. The name of the format became PDBx and is used internally for data processing and data exchange.17 Because the legacy format was used so widely, every PDBx file is also converted and made available in the PDB file format. However, this has meant that for large files such as the ribosome, the data are split into multiple files. This does not serve the community well.

About 1 year ago, the wwPDB met with software developers of refinement programs, and it was agreed that the PDBx format should now be the official format for PDB files. These developers are in the process of adjusting their programs to accept and produce PDBx data files. Ample notification to PDB data depositors and users will be required so as to ensure no disruption of service.

The ways in which data have been deposited and annotated have changed over the years. When the RCSB PDB took over the management of the PDB in 1998, it created a new deposition and annotation system called ADIT10 that could accept either mmCIF or PDB files, but used mmCIF internally. This system was adopted by PDBj. PDBe reengineered BNL's AutoDep system18 to collect and process data in PDB format.19 Considerable efforts went into ensuring that although the systems for collecting the data are different, the end product would be the same no matter where the data were processed. In 2007, it was decided that the overhead involved in maintaining these different systems was much too high. The three data collection centers began to collaborate in the creation of a single deposition and annotation system that uses PDBx as the master format and offers significant improvements in processing efficiency and assurance of data quality. As part of this project, the requirements for every aspect of the process were analyzed using the combined experience in data processing of all the centers. The system will begin testing in Fall 2012.

Another key wwPDB collaboration is the regular review and remediation of the data across the archive. These reviews become particularly important as new methods and discovery begin to challenge how all structures are curated and represented in the archive. In the past reviews, formats of atoms names have been made uniform, details about polymer and ligand chemistry have been improved,20 and the representation of antibiotics and peptide inhibitors has been made uniform. Viruses are now represented in more useful ways.21 The next project we are tackling is the representation of carbohydrates in the archive.

The wwPDB centers are ever mindful that the community expects 24/7 operations and the rapid delivery of quality data. We also know that although users may complain about some aspects of the data, they are also hesitant about change that could cause the disruption of PDB services or to the software they use in their labs. So, the challenge is to continue to improve the underlying data without causing problems for the users. Thus, a considerable amount of planning must go into any new rollout of data.

Although the wwPDB data centers collaborate on annotating data, so that the PDB files are the same no matter where they are retrieved, each site provides different services for analyzing, visualizing, and comparing the data.11, 22–24 These multiple sites, tools, and ways of exploring the data are incredibly useful for “exercising” the data.

Community stakeholders

By definition, a community database must engage with its users. From Day 1, the PDB needed the trust and support of the structure authors. In the early days, this meant convincing researchers to deposit their data and working with them to be sure the data are represented accurately. A few very active protein crystallographers, among them Fred Richards, Max Perutz, and Michael Rossmann, were vocal in their support of this enterprise. Although many people did voluntarily deposit data, others chose to keep the data private. In 1982, the very visionary NIH program officer Marvin Cassman became concerned that data key to the development of drugs for the emerging public health threat of AIDS might not see its way into the PDB. Richards and Dick Dickerson were very vocal in their concerns that valuable data would be lost if data deposition was not mandatory.25 They circulated a petition urging mandatory data deposition, signed by 182 prominent scientists from around the globe. The International Union of Crystallography established a committee charged with creating deposition guidelines. The committee spent a great deal of time defining the scope of what kinds of data should be deposited and when. The guidelines were published in 1989.26 These guidelines then became the basis for journal and funding policies. Today, virtually every journal requires data deposition for publication. It is very important to note here that the data providers who know their data best were the ones who came up with sensible and enforceable guidelines.

The same process is being used for creating validation standards. The wwPDB has set up a group of Task Forces who have been charged with coming up with best practices for validating data. So far, Validation Task Forces (VTF) have been set up in X-ray crystallography, NMR spectroscopy, electron microscopy,27 and now small-angle scattering. The X-ray VTF published a detailed set of recommendations based on careful analysis of the data.28 These recommendations are now being implemented by the wwPDB and will be part of the new Deposition and Annotation tool. The other task forces will follow suit. The extraordinary care by which these validation standards are being set by experts in the field will ensure that the quality of the data in the archive will continue to be high.

The journals are also key players. Just as every published structure must be deposited in the PDB, journals are beginning to require the submission of validation reports as part of the editorial process as a result of the work of the VTFs.29, 30

Most people who use the PDB are not experimental structural biologists. Computational biologists use the data for classification, analysis, and structure prediction. Experimental biologists use the data to better understand protein function. Educators from grade school to postgraduate level use the PDB in their classrooms. For these users, the wwPDB websites provide rich features and tools such as interactive graphics and educational content. The very widespread use of the data makes it particularly important that it is well curated and that there are presentations of the data that are accessible to a diverse audience.

Trends in the data

The PDB has grown from an archive of seven structures with relatively low-molecular weight to now more than 83,000 structures (Fig. 1). At the current rate of deposition, it is projected that 10,000 structures will be deposited in 2012—more than the size of the entire archive in 2000. Will the growth rate continue? Yes, but perhaps not at the rate seen in the 1990s. As shown in Figure 2, the average molecular weight deposited per year has more than tripled since the PDB began. This is of course due to the vast improvements in technology in all structure-determination methods. For example, synchrotron sources are now used routinely for X-ray data collection (Fig. 3). Scientists have been enabled to take on increasingly challenging problems to answer questions in biology, and they do. As part of the NIH's PSI:Biology program (Pieper et al., in preparation),32 nine groups are tackling notoriously difficult membrane proteins. As we develop the methods to determine these structures, the rate of structure production may decrease before we see the expected increase as we did when high-throughput methods were perfected. A slowdown occurred in 2008, when 7073 entries were deposited; 8130 entries were deposited in the previous year. This change correlates with the discontinuation of the Structural Genomics group in Japan in 2007.33 Similar abrupt changes in funding of large projects and facilities anywhere in the world could have the same effect. Another trend is the increase in the number of ligand bound structures deposited each year [Fig. 1(b)]. These ligands range from very simple organic molecules to complex peptide-like inhibitors and antibiotics [Fig. 1(c)]. The latter increase is attributed to the attempts to design inhibitors for proteases, thrombin, and, more recently, the ribosome. The data contributed by the public sector are no doubt only a fraction of the data being produced by the pharmaceutical industry, most of which are not archived in the PDB.

Figure 1.

Growth of the PDB archive: (a) deposited (in black) and released (gray) structures shown using a logarithmic scale; (b) total number of unique nonpolymer ligands released each year (a single entry may have several ligands); (c) number of peptide-like inhibitor/antibiotic entries released per year. There are three notable peaks in (c): 73 structures with an inhibitor/antibiotic were released in 1994, the majority of which are thrombin inhibitors and renin inhibitors; 130 structures in 2006, the majority of which are thrombin inhibitors and protease inhibitors; and 140 structures in 2011, the majority of which are protease inhibitors and caspase inhibitors.

Figure 2.

Growth in the size (assessed by molecular weight) of released structures in the PDB for entries determined by X-ray crystallography (grey) and NMR (black). For X-ray, the molecular mass was calculated without solvent, corrected for noncrystallographic symmetry (NCS), and combining split entries into single structures. For viruses or entries that used noncrystallographic symmetry, molecular weights for the entire asymmetric unit were calculated by multiplying the molecular weight of the polymeric chains by the number of nonidentity NCS operators. For NMR, mass was determined excluding water. The large increase shown in 1984 is due to the release of the tomato bushy stunt virus 2tbv.31

Figure 3.

Growth in the use of synchrotron X-ray sources in structure determination as determined from X-ray source annotation. The number of structures determined using synchrotron radiation is shown in gray and the number using home-laboratory sources in black. This plot shows that while the use of home sources in X-ray structure determination has remained roughly constant, the use of synchrotron sources has increased rapidly.

Although the average resolution of X-ray structures in the PDB has stayed at 2.1 Å since 1987, a substantial number are better than that average, thus yielding very high-quality structures (Fig. 4). The number of low-resolution structures has also increased and can be correlated with the increased activity focused on molecular machines of very high-molecular weight.

Figure 4.

Range of the reported high-resolution limit in released X-ray structures in the PDB archive. The resolution limit range of all entries is indicated by the green area. The blue-dotted line indicates the limit of the range if the outlying lowest-resolution structure for that year is removed. The mean resolution limit for a given year is indicated by the dashed red line. The median value after 1989 (not shown) is roughly the same as the mean. Methods developed in the last decade can handle the refinement of lower resolution structures, the value of which has been accepted by the community.

The effect of doing routine validation of structures before deposition into the PDB has been demonstrated to have a dramatic effect on the quality of structures treated in that way. During PSI efforts 1 and 2, all structures were thoroughly checked by the structural genomics centers before deposition, and very high-quality structures were produced.34 The initial fear that high-throughput would result in inferior structures was unfounded. Implementing the recommendations of the wwPDB VTF will help ensure that all structural biologists will perform validation routinely.

In the search for new folds, it is important to select structures containing sequences with much less than 30% sequence similarity. And even with that criteria, there is no guarantee that the protein will have a new fold, as was demonstrated in the early phases of the PSI structural genomics efforts. To understand protein function, however, working with redundant sequences is crucial (Table I). Work on phage T4 lysozyme, in which structures containing sequences with systematic substitutions were determined, gave us dramatic insights into protein folding.36 The determination of the structures of HIV protease and reverse transcriptase bound to different ligands was critical to the development of effective drugs against AIDS (Fig. 5).56, 57

Figure 5.

HIV and drugs bound to HIV protease and reverse transcriptase. (a) HIV structures in the PDB.37–53 Image by David Goodsell is available from the RCSB PDB website. (b) HIV-1 reverse transcriptase,54 HIV-1 protease,55 and related FDA-approved inhibitors. Inhibitors shown in green are available in the PDB bound to a protein and labeled with the first year an example was released in the PDB. Inhibitors shown in gray have not been deposited in the PDB. As of July 2012, there are more than 160 structures of HIV-1 reverse transcriptase in the PDB; ∼ 75 are bound to a drug or inhibitor. There are almost 450 structures of HIV-1 protease in the PDB; ∼ 260 are bound to a drug or inhibitor.

Table I. Clusters of proteins with redundant sequences
Cluster numberNumber of structuresProtein clusterBiological connection
  1. Using the sequence data of all protein chains (as of May 2012), a clustering of sequences was created using BLASTclust35 with a 95% sequence identity cutoff. Resulting clusters were then ranked based on the number of structures and tabulated. Differing clusters of proteins with related functions are reported on the same line separated by commas. The PDB IDs of the entries in each cluster are available in the supplementary materials.

1522, 374, 203Lysozyme: bacteriophage T4, hen egg white, humanEvolution, folding, catalysis
2446HIV-1 proteaseHuman immunodeficiency virus (HIV)
3394Human carbonic anhydrase IIOsteopetrosis
4379, 300, 266Trypsin, thrombin, thrombin light chainCatalysis, blood clotting
5361β-2 microglobulinMHC complex, virus protection, cancer
6229, 199, 197Whale myoglobin, human hemoglobin β-subunit, human hemoglobin α-subunitEvolution, sickle-cell anemia, thalassemia
7217CDK2DNA damage repair, cancer, cell proliferation
8190Ribonuclease ASpecificity, catalysis
9188MAP kinase 14Arthritis, stresses, and proinflammatory cytokines
10188β-Secretase 1Alzheimer's disease
11171, 169Insulin A, B chainDiabetes
12166, 159Reverse-transcriptase RNAse H, p51Human immunodeficiency virus (HIV)
13162, 158Class I histocompatibility antigen A-2 (human, mouse)Antibody recognition
14157Glycogen phosphorylaseType 2 diabetes, catalysis
15146+Ribosomal proteinsProtein biosynthesis
16154Cytochrome C peroxidaseOxidative chemistry
17152TransthyretinAmyloid diseases
18149Green fluorescent proteinBioluminescence

The PDB can now be used to describe whole systems in biology. For example, it has been possible to describe the structures of the proteins expressed in a whole organism (Thermotoga maritima),58 and there has been significant progress in our understanding of proteins involved in cancer.59 Significantly, it is now possible to have a molecular description of entire pathways. Although it took 25 years, we now have structures of all the enzymes involved in the tricarboxylic acid cycle (Fig. 6). Similar results can be seen with respect to other pathways.70

Figure 6.

The tricarboxylic acid cycle, also known as the Krebs cycle, shown from a structural biology perspective. Shown are the images of full (or nearly full) structures available in the PDB for all of the enzymes in the cycle.60–69

The seemingly impossible dream of having a structural view of biology is much closer than we thought.

Acknowledgements

Many people at the wwPDB data centers support the continued viability and quality of the PDB archive, in particular that of the leaders of the member sites Kim Henrick, Gerard Kleywegt, Haruki Nakamura, and John Markley. The work of RCSB PDB team at Rutgers University and the University of California San Diego over the years is greatly appreciated, especially that of John Westbrook, Phil Bourne, Zukang Feng, and Christine Zardecki. Special thanks to Brian Hudson and Ezra Peisach for their thoughtful help with this manuscript. Helen Berman is the recipient of the Protein Society 2012 Carl Brändén Award.

Ancillary

Advertisement