Rationally selected basis proteins: A new approach to selecting proteins for spectroscopic secondary structure analysis


  • Keith A. Oberg,

    1. Structural Biology and Bioinformatics Center, Structure and Function of Biological Membranes Laboratory, Free University of Brussels (ULB), B-1050 Brussels, Belgium
    Search for more papers by this author
  • Jean-Marie Ruysschaert,

    1. Structural Biology and Bioinformatics Center, Structure and Function of Biological Membranes Laboratory, Free University of Brussels (ULB), B-1050 Brussels, Belgium
    Search for more papers by this author
  • Erik Goormaghtigh

    Corresponding author
    1. Structural Biology and Bioinformatics Center, Structure and Function of Biological Membranes Laboratory, Free University of Brussels (ULB), B-1050 Brussels, Belgium
    • Structural Biology and Bioinformatics Center, Structure and Function of Biological Membranes Laboratory, CP 206/2, Free University of Brussels (ULB), Bld du triomphe, Acces 2, B-1050 Brussels, Belgium; fax: 32-2-650-5382.
    Search for more papers by this author


Protein basis sets have been extensively used as reference data for the determination of protein structure with optical methods such as circular dichroism and infrared spectroscopies. We have taken a new approach to basis protein selection by utilizing three crystal structure classification databases: CATH, SCOP, and PDB_SELECT. Through the use of the information available in these and other online resources, we identified 115 commercially available proteins as potential basis set candidates. By carefully screening the quality of the crystal structures and commercial protein preparations, we obtained a final set of 50 rationally selected proteins (RaSP50) that has been optimized for use in spectroscopic protein structure determination studies. These proteins span the full range of known protein folds as well as α-helix and β-sheet contents, and they represent a more comprehensive variety of fold types than any previous reference set. This report includes a detailed presentation of the reasoning behind the rational protein selection process, a description of the properties of the RaSP50 set, and a discussion of the types of structural and spectral variations that are represented in the set.

Subscripts are used to designate specific secondary structure types using the notation of the Dictionary of Secondary Structures of Proteins (DSSP), and are as follows: H, α-helix; E, β-sheet; T, turn (all types); G, 310 helix; I, π-helix; S, a sharp bend in the protein backbone which cannot be assigned as T; B, a residue with extended φϕ angles that cannot be assigned as E. The notation Σother will used to designate all residues with non-H, E, or T assignments, whereas C will be used to specifically designate residues that are not given any secondary structure assignment by DSSP.

Infrared (IR) and circular dichroism (CD) spectroscopies are the two most commonly used techniques for determining the structure of proteins that have not been—or cannot be—characterized with NMR or X-ray crystallography. Structure analysis using these optical methods involves the reconstruction of protein spectra with synthetic or derived bands that have known relationships to different secondary structures. The earliest attempts at developing spectra analysis methods relied on reference spectra from model polypeptides (Greenfield and Fasman 1969; Brahms and Brahms 1980), but as protein crystal structures became available, there was a shift to a reliance on those proteins with solved structures. By using proteins rather than synthetic peptides, it was thought that the information contained in reference spectra would better represent naturally occurring polypeptides. In addition to this obvious benefit, this practice also permits analysis methods to be tested on actual protein spectra. In this manner the effectiveness of different methods can be evaluated by the comparison of determined protein structures with crystal structures.

The earliest efforts, published in the 1970's, utilized small sets of three (Saxena and Wetlaufer 1971) or 5–8 (Chen and Yang 1971; Chen et al. 1972, 1974) proteins (basis sets) with some of the earliest crystal structures. Several years later, larger basis sets were introduced by Chang et al. (1978) and Hennessey and Johnson (1981). Between them, the sets contained 25 proteins representing from 0% to 79% α-helix and 0% to 51% β-sheet (as measured by the assignment algorithms used in their studies), even though more recently a few proteins have been added to this core of 25 proteins. There are some features of protein spectra analysis, discussed below, that make it different than other common applications of chemometric analysis methods. However, an essential property of any basis set is that the full range of possible analyte concentrations (or protein structures) must be represented if analysis accuracy is to be obtained. Thus, for protein basis sets, the need for the broadest possible range of representative structures cannot be overemphasized.

The standard basis set proteins cover a wide range of α-helix and β-sheet contents (‘concentrations’), which suggests that they constitute a good basis set. However, whether or not this is true depends on the relevance of the criteria used by assignment algorithms to evaluate protein ‘secondary structure concentrations,’ such as those of Levitt and Greer (1977) or Kabsch and Sander (1983). It is undeniably true that there is a relationship between assigned protein secondary structures and spectral band shape. However, these correlations are actually quite general in nature. For example, α-helix produces a signal around 1654 cm−1 in infrared spectra, but a huge number of bands at other frequencies assigned to α-helix can be found in the literature (1648 to 1662 cm−1 in 1H2O, 1642 to 1660 cm−1 in 2H2O, reviewed by Goormaghtigh et al. (1994). This clearly shows that protein IR spectra also reflect variations in the nature of α-helices in addition to the number of residues in helical conformations. Although such dependencies are recognized, they are not yet well understood (Nevskaya and Chirgadze 1976; Dousseau and Pézolet 1990). In CD spectra, the dependence of band shape on α-helix content is more consistent, but all-β-sheet proteins exhibit many different CD band shapes (Perczel et al. 1992). The existence of two types of CD spectra for β-rich proteins form the basis for their classification as βI- and βII-proteins as proposed by Sreerama and Woody (2003). The source of βII-protein CD, which resembles that of unordered polypeptides, is not yet clearly understood. However, lacking alternative or complementary criteria for protein selection, other than intuition, protein secondary structure contents have been used by default.

To simplify the discussion, we will focus primarily on infrared studies of proteins in 1H2O, as these are most comparable to the present work. Comprehensive reviews of CD analyses have been written by Woody (1995) and Venyaminov and Yang (1996). The mathematical methods used in the development of CD spectra-based analysis of protein conformation have typically been multivariate regressions of one form or another. In 1990–91, statistical methods were introduced into the IR analysis field with the nearly simultaneous publication of several studies (Dousseau and Pezolet 1990; Kalnin et al. 1990; Lee et al. 1990; Sarver and Krueger 1991a,b). These focused primarily on mathemati-cal methods, but also addressed other issues relevant to IR such as the data regions used, the effects of deuteration, different normalizations, and the combination of IR and CD spectra. More recent infrared studies in this genre have dealt with data types (Pancoska and Keiderling 1991; Pancoska et al. 1991; Pribic et al. 1993; Baumruk et al. 1996; Wi et al. 1998) and alternative analysis methods (Venyaminov and Vassilenko 1994; Pancoska et al. 1996), and one study briefly mentions the effects of removing proteins from a 28-protein basis set (22 folds, as defined below) in a vibrational CD study in D2O (Pancoska et al. 1995).

Previously, fractional compositions (FCs) were the only readily available information that could be used in basis protein selection, other than intuition, and so FCs have necessarily been the primary criterion for selection. Now that the number of available crystal structures has reached the thousands, coherent protein structure classification systems have been developed. CATH (Orengo et al. 1997) and SCOP (Barton 1994; Murzin et al. 1995; Hubbard et al. 1997) are two such schemes that classify proteins based on their three-dimensional structures. The recent appearance of SCOP and CATH has made it possible to include protein fold as a criterion in basis protein selection and to evaluate how well a basis set reflects the full range of known tertiary structures.

In the following text we discuss the use of information contained in these and other crystal structure databases to construct a protein basis set that may help to answer many of the questions posed here. We identified 115 commercially available (and 5 other) proteins with solved crystal structures as potential basis set candidates. From these, a 50-protein basis set was assembled that not only covers the broadest possible range of secondary structure contents, but also the largest possible variety of tertiary structures. We demonstrate the germaneness of the approach used here with a comparison of spectra from proteins with similar secondary structure contents (FCs) but different folds.

Results and Discussion

To provide an optimal basis for spectroscopic secondary structure determination, a set of reference proteins must represent the widest possible range of structure types. Here, we use the term ‘fold’ to refer to tertiary structure, or the spatial organization of secondary structural units, that can produce strain, distortions, or other differences in the geometry of secondary structural units.

In this study, the CATH database was used as the primary tool in a search for suitable proteins. To supplement the ranges of α-helix (FCH) and β-sheet (FCE) represented, the PDB_SELECT database was also utilized. From the list of potential basis set candidates obtained in this manner, proteins were selected and their suitability for use in an experimental basis set evaluated using supplemental information obtained from the SCOP, DSSP, and SWISS-PROT databases. If a protein was found to be unusable (not pure when acquired, denatured), potential replacements were identified by returning to the CATH or SCOP databases. The resulting basis set of rationally selected proteins (RaSP) represents a wide range of different protein structures, and the spectra of these proteins exhibit a wide range of variation, some of which does not depend entirely on secondary structure FCs.

Construction of the RaSP set

Protein selection based on fold

The first step in the protein selection process was a search of the CATH database for proteins that have been crystallized, classified, and are also commercially available. Commercial availability was chosen as an important criterion for protein selection in order to make the RaSP basis set accessible to every scientist. The goal of the selection strategy was to identify representative proteins for as many different folds as possible. By using protein fold as the primary protein selection criterion, the largest possible range of structural variation has been incorporated into the set. Because of the dependence of protein spectra on subtler differences in structure, it follows that this selection strategy has produced a wide range of spectral variation as well (see below). To maximize the variety of structures represented in the RaSP set, a few proteins provided by some of our collaborators were included because they represented unique folds.

The CATH system was chosen as the primarily protein selection tool for this study because its organization provided a convenient and effective way to choose proteins with differing folds. The CATH numbers for all proteins identified as potential basis set members are listed in Table 1. A CATH number is assigned to each domain in a protein, so it is possible for multidomain or multichain proteins to have several different CATH numbers or more than one domain with the same CATH number. To simplify the table, the CATH number is listed only once (not repeated) for proteins with multiple domains that have the same CATH assignments.

A brief discussion of the CATH and SCOP classifications will help illustrate their usefulness in the construction of a basis set that is representative of protein structure in general. The first level of both hierarchies is Class (C), which is based largely on the relative amount of α-helix and β-sheet in the classified proteins. Since the band shapes of IR and CD spectra depend strongly on α-helix and β-sheet content (FCH and FCE), it is immediately apparent that there must be some correlation between spectra and classification. A protein domain's position in the second level of CATH, Architecture (A), is based on its overall fold (Fig. 1 abscissa), or general arrangement of secondary structural units. The first two levels in SCOP, Class and Fold, are also listed in Table 1 to provide a more intuitive reference, since CATH numbers do not explicitly indicate α/β and α+β domains. The third level of CATH, Topology (T; Fig. 1 ordinate) further refines the classification to group protein domains in a given Architecture based on variations in the specific organization, number, and connectivity of secondary structural units. Both the CATH Architecture and Topology numbers have been assigned in such a way that they tend to increase with the complexity of a domain's fold. The fourth CATH level, Homologous Superfamily (H), groups proteins using first differences in primary sequence within a given topology, and then structural equivalence. At this level, the number of proteins within a given group is typically small, as are variations in structure, and so the Homologous Superfamily level was not used as a criterion for the protein selection process.

To show the extent to which the RaSP set represents the full range of possible structure variation, the entire available ‘CATH space’ is delineated in Figure 1 with solid bars, over which the RaSP proteins are plotted. The bars representing CATH space show that there are many available Topologies in some Architectures (e.g., αβα sandwich), whereas other Architectures have only one known Topology (β-clam). Both the Architecture and Topology levels of the CATH hierarchy represent large differences in structure, so their combined use for protein selection insures a broad range of structural, and thus spectral, variation. In contrast, the smaller structural differences between proteins whose classification is identical except in their CATH Homology number would not be expected to produce large spectral differences. Because of this, it was possible to move on once a commercially available protein was identified in a given Topology.

Roughly 90 proteins were identified in the fold-based search as potential reference set candidates. Ideally, if a basis set is to represent all possible variations of protein structure, it should completely fill CATH (or SCOP) space. Not surprisingly, we found that the majority of the proteins listed in these databases are not available commercially. Despite this limitation, the proteins identified as RaSP set candidates are well distributed in CATH space, and thus we can conclude that they provide the most comprehensive sampling possible of known protein structures. Since the structural differences that cause variations in protein spectra are not completely understood, we suggest that the range of folds represented by a basis set may be the best measure of how likely it is to be applicable, in a general sense, as a reference for protein structure determination.

Protein selection based on α-helix and β-sheet contents

The secondary structure percentages, FCH and FCE, are two of the best characterized structural characteristics of proteins and are the strongest, if somewhat inconsistent, determinants of spectral band shape. Therefore, protein classification schemes and secondary structure percentages are complementary, and both were used here to maximize the extent of structural variation included in the RaSP set. Figure 2 presents the proteins considered for the RaSP set as a function of their α-helix (FCH) and β-sheet (FCE) contents in ‘HE space.’ Again the full range of available structures is indicated on the graph for comparison (small points). These points show the HE values of proteins in the PDB_SELECT database, which groups proteins based on sequence homologies.

After the fold-based step in protein selection, additional potential proteins were chosen from the PDB_SELECT database (at the 55% homology level) based on their having unique FCH and FCE values compared to the proteins already selected, and of course their commercial availability. In the FC-based step, approximately 20 proteins that filled in sparsely populated regions of HE space were added as potential basis set candidates, bringing the total to ∼110. Many of the added proteins had low FCs of non-periodic structures (T, C, etc.), and fall near or above the line defined by (0.8)%H + %E = 52, or have FCH values between 40% and 65%. Special attention was given to identifying potential helical proteins at this stage because analysis of all-helix protein IR spectra, especially curve fitting, generally tends to underestimate the actual FCH. Furthermore, few examples of proteins with more than 45% α-helix have appeared in previous basis sets. In this step, it was necessary to select some proteins with ‘redundant folds’ to obtain the widest possible HE space distribution (TRO and PAB; PEP, PGN and REN; SBC and SBN; IGG and SOD; see Tables 1 and 2, Table 2.). Some redundant proteins were also chosen because they have appeared commonly in previous basis sets and will be used in future examinations of structure-spectrum correlations (MBN, HBN, and COL; ALA and LSZ; CTG and TGN).

The RaSP points plotted in Figure 2 illustrate that acceptable proteins were difficult to find in some regions of HE space. These regions were examined carefully, but few candidates could be identified, for example, with total combined α-helix and β-sheet contents of less than 35% (indicated by the line %H + %E = 35 in the figure; CAS, MTH, TIK). There were three other regions where acceptable proteins were rare. These are proteins with β-sheet contents >52%, proteins with >10% α-helix and >34% β-sheet, and proteins with 50%–70% α-helix. We were able to partially fill two of these gaps with ATX and APE, which are not available commercially.

Elimination of unsuitable proteins

Up to this point, the focus of the selection process was on maximizing potential spectral variability by finding as many commercially available proteins as possible. The next step was to optimize the crystal structure information for the proteins. This was achieved by identifying the best crystal structure for each protein and also by rejecting those proteins whose crystal structures did not meet the requirements discussed below. Where possible, rejected proteins were replaced by more suitable proteins with the same CATH number.

This reference information refinement process began with a comparison of the sources (species) of the commercial proteins and crystallized proteins (Table 2). Wherever possible, crystal structure–commercial protein pairs with the best possible sequence homologies were chosen. To accomplish this, the SCOP database, at the ‘Protein’ level of the hierarchy, was consulted to find all species from which each protein had been crystallized. Comparison with the commercial protein sources often resulted in a direct species match (i.e., an identical sequence).

Next, the sequence homology between the crystal structures and commercial proteins was evaluated, where possible, using the HSSP database (Sander and Schneider 1993; Dodge et al. 1998). Normally, proteins with 25%–30% sequence homology will have similar folds (Flores et al. 1993; Hilbert et al. 1993; Orengo et al. 1994). However, relatively small differences in primary sequence can cause variation in the spectra of proteins (Prestrelski et al. 1991). For example, bovine and human ALA have 75% sequence homology, and the same general fold; however, both their CD and IR spectra are significantly different (Keiderling et al. 1994). Therefore, those proteins with less than 85% sequence identity were rejected as potential basis set candidates. Exceptions, retained because of unique structures, were CAH (79% identity) and CNA (82%).

To further verify the correspondence of each chosen crystal structure to the commercially available protein, the SWISS-PROT database was consulted. For proteins that could be located in SWISS-PROT, the number of amino acids in the sequence was counted after removal of any propeptides, signal sequences, etc., and then compared with the number of residues in the crystal structure. Proteins with crystal structures that lacked a substantial number of residues were rejected; it was possible to set the cutoff to 5%–6% missing residues without seriously reducing the number of potential proteins.

Optimizing crystal structure quality

It was possible to go one step further in the optimization of reference information by analyzing the quality of protein X-ray crystal structures. There are actually several factors that can cause discrepancies between crystal structures of the same protein. Such variations are often artifacts from the process of solving and refining the structure, but can sometimes reflect real differences in the structure of the protein in crystals obtained from different conditions. To illustrate, the DSSP assignments of 37 complete, wild-type crystal structures of lysozyme (LSZ) were examined with the STRUCTAB program. The results listed in Table 3 show that the FC values for these crystal structures cover a considerable range. Thus it is obvious that if one of these structures can be identified as being of higher quality than the others, it should be used as the reference in structure analysis.

The selection of the best crystal structure for each protein was performed by consulting a database of results from the WHAT_CHECK program, which is kept at the European Molecular Biology Laboratory (EMBL). WHAT_CHECK compares many aspects of an individual protein crystal structure with a reference set of very high-quality structures and derives a series of RMS Z scores that reflect the quality of different aspects of the structure in question. Although crystal structure quality tends to increase with finer resolution, the correlation is similar to the dependence of protein spectra on structure. That is to say that individual cases may not always follow the expected trend. Therefore, instead of simply choosing the highest resolution structure for a given protein, the WHAT_CHECK Ramachandran plot and backbone conformation RMS Z scores, as well as the number of unsatisfied buried H-bonds in each crystal structure, were compared. The structure with the best scores was selected. These criteria were used because they should be good indicators of the correspondence between crystal structure and spectra. The crystal packing RMS Z scores were considered when one protein had two or more crystal structures with similar scores for the other values. The PDB identification code and scores of the selected crystal structure for each protein are listed in Table 2, and the FC values are given in Table 1. Proteins with very low WHAT_CHECK scores were rejected.

Protein acquisition and purity testing

The rejection of unsuitable proteins from the set left 106 potential basis set candidates. Finally, proteins were selected from these for acquisition in groups of 10–20, based on their cost, advertised purity, and position in the CATH and HE space distributions. SDS-PAGE was used to screen acquired proteins for purity. Estimates of protein purity are included in Table 2. Many proteins proved to be less than ∼85% pure, and so were discarded. Overall, 22 out of a total of 72 acquired proteins were discarded because of impurity or other technical problems. Where possible, discarded proteins were replaced by acceptable substitutes that were found by returning to CATH (∼10 new potential proteins were found at this stage). The effects of purity on analysis results will be discussed elsewhere.

After the selection, acquisition, and elimination process, there remained a set of 50 proteins that had been acquired and met all selection criteria or were considered to be reasonable compromises.

Description of the RaSP50 basis set

We will refer to the final set of basis proteins as the RaSP50 set from this point onward. The proteins in the RaSP50 set range from 6 (INS) to 700 (IGG) kD in size, and contain from 1 to 7 (ATX) chains and 1 to 6 (PAH) domain folds.

The relevance of the RaSP50 set as a general tool for experimental studies is demonstrated by the range of structures represented as compared to known protein structures. The RaSP50 set contains 42 unique protein types which represent all four CATH classes, 18 out of 31 possible Architectures, and at least 60 of the 482 known Topologies. These 60 CATH Topologies contain roughly 6200 of the 10,344 domain entries included in the CATH database at the time the basis set was constructed. To look at the comprehensive nature of the RaSP50 set from a different perspective, consider the final column of Table 1, which lists the superfold family membership of the proteins (Orengo et al. 1994). Currently nine protein superfold families have been identified, which account for 46% of all known nonhomologous protein folds (note that many proteins do not have any superfold domains). The RaSP50 set contains representatives of eight of these superfolds. It was not possible to include a representative of the ninth, split αβ sandwich.

To more clearly illustrate some of the structural variation in the RaSP50 set, fragments of several protein crystal structures are diagramed in Figure 3 (Kraulis 1991). In the top row of the figure, the β-strands of four of the RaSP50 proteins are drawn. The structure of avidin (AVI) is a simple β-barrel with eight antiparallel β-strands connected primarily by short loops (omitted for clarity). Several of the β-class proteins included in the RaSP set, such as AVI, have predominantly regular and undistorted strands. The β-sheets in CNA are an example of sharply curved β-strands and also illustrate interactions that can give rise to distortion of common structures. BTE is an example of a much simpler strand system, and UOX illustrates that many different variations and distortions can coexist in a single protein. The middle row of the figure gives three examples of αβ proteins with different tertiary structures, including sheet around helix (UBQ), helix on sheet faces (RNA), and helix around sheet (αβα sandwich, PGK). The one protein with no α-helix or β-sheet included in the RaSP50 set, MTH, also appears here. The bottom row includes helical portions of four proteins. Many different variations of α-helix types are included in the RaSP50 set, including long regular helices (FTN), medium-length (MBA), and short helices (INS). The four selected helices from LOX, which contains 45 helices overall, illustrates some of the helix distortions present in the RaSP50 set, including kinks (bottom two) gradual bends (third from bottom), and sharp bends (top helix) that may all contribute to variations in spectra.

CATH space distributions

The CATH and SCOP classifications are based on the spatial arrangement and connectivity of units of secondary structure rather than specific distortions present in individual groups of residues. There is, as of yet, no classification system that specifically incorporates deviations of secondary structures from ideality. Nevertheless, structural distortions are very much a product of the contacts and connections between units of secondary structure. Thus, the CATH database provided the best way for us to maximize the variation in these features and to produce, as a result, the largest possible range of spectral variation. By using CATH, it was possible to include long, short, classical, and distorted helices and β-strands; straight, curved, and kinked α-helices; twisted, flat, and curved β-sheets; β-bulges; many different modes of packing (topologies); and many different combinations of helix and sheet FCs.

Several suitable proteins were found in all of the CATH Architectures with more than 10 Topologies, except 1.30. Of the three Architectures with more than 50 topologies, the αβ sandwich (CATH 3.30) was most challenging. The three RaSP50 3.30 proteins are all complex and have multiple domains (OVA, PAH, UOX). The distribution of the RaSP50 set members in the αβα sandwich Topology (CATH 3.40; 24 identified; 14 acquired; five discarded) is more even than that of the αβ sandwich, and ranges from 3.40.50 to 3.40.630. The 3.40 Architecture contains the αβ doubly wound superfold family, whereas 3.30 includes proteins of the Split αβ sandwich superfold, which is the only superfold that could not be represented in the RaSP50 set.

The third large CATH Architecture, all-α/non-bundle (CATH 1.10), deserves special attention here. In all, 29 1.10 proteins representing ≥29 different Topologies were identified (some of these proteins have not yet been assigned CATH numbers), with 16 proteins (15 Topologies) appearing in the final RaSP50 set. Under the SCOP system, 14 RaSP domains are of the all α-helix class, and 11 of these are in the RaSP50 set. The difference in the number of proteins with CATH and SCOP all α-helix classifications arises from the different ways that protein domains are identified in these two systems: Several domains with α or α+β assignments in the SCOP database are treated as multiple domains in CATH, with one or more classified as all α-helix (BLM, CAB, CAT, DNA, PAH, PAP, POX, THR). Regardless of the classification system used, a wide variety of different helical topologies with classical and exceptional helix geometries are represented in the RaSP50 set, including examples of both the α up-down and globin superfold families.

HE space distributions

We now turn our focus to α-helix and β-sheet fractional compositions (FCs), which was the secondary protein selection criterion used in the construction of the RaSP set. Figure 2 compares the fractional compositions, FCH and FCE, of the RaSP proteins with ∼900 PDB_SELECT members (35% homology cutoff). The points in the figure representing the PDB_SELECT proteins occupy a relatively well defined region of HE space, with the majority between the lines %H + %E = 35% and %H + %E = 65%, though several all-α proteins have more than 65% α-helix. Nearly all RaSP set members also fall within these limits, but some portions of HE space are populated more densely than others. This point about the relationship between α and β structures was already reported by Pancoska et al. (1995). Considering again the PDB_SELECT proteins, a region of high protein density is bounded by the %H + %E = 35%−65% limits and the lines %E−%H = 5% and %E = 10% (region 1). The highest density of RaSP proteins also lies in region 1 and includes 43 proteins. If the PDB_SELECT database with a 55% homology cutoff is used (data not shown), smaller clusters of proteins are apparent within the limits %H = 10%–20% and %E = 30%–40% (region 2) and at %H = 0%–12% and %E = 43%–50% (region 3). These regions contain four and seven RaSP50 proteins, respectively. Overall, the distribution of the RaSP50 proteins in HE space parallels the natural dispersion. This was achieved even though α-helix and β-sheet composition was not the primary criterion for protein selection.

Structure frequency distributions

We have shown that the distribution of protein folds and fractional secondary structure compositions in the RaSP50 set covers much of the same range as crystallized proteins. There are, in addition to the CATH and HE spaces, other relevant comparisons that will further establish the general nature of the RaSP50 set. Figure 4 presents the frequencies of different structures that occur in the RaSP50 and PDB_SELECT proteins as modified histograms. It is immediately clear from the figure that the selection processes used in constructing the RaSP50 set has produced distributions that are similar to the patterns in the population of known protein structures. The FCH histogram (Fig. 4H) shows that the largest number of proteins have <10% αhelix in both RaSP50 and PDB_SELECT (35%) sets. They also have peaks at 30%–40% helix, which is consistent with the high density of proteins in region 1 of HE space (Fig. 2). The β-sheet distribution for the PDB proteins also reflects the HE region 1 population density, but the curve for the RaSP50 set is slightly different because of our focus on proteins with high helix content (and therefore low FCE). This curve also emphasizes the relatively few acceptable proteins found with 20%–30% β-sheet. The RaSP50 and PDB_SELECT curves are highly similar in the FCT and FCC plots, as they are in the structure size distributions on the right side of the figure. These distributions provide further evidence that the RaSP50 basis set has many of the same essential properties as the population of proteins of known structure.

Importantly, the maximum of the FCC histogram in the RaSP50 proteins (and in the PDB) lies mainly in the 10%–30% range, most FCT values are between 5% and 17%, and most RaSP50 proteins have less than 8% 310 helix (not shown). From this lack of variation we can conclude that it should be more difficult to accurately define a correlation between band shape and these structures using statistical analysis methods.

The variability of spectra from proteins with similar FCs but different folds

To conclude our examination of the set, we briefly present several examples of RaSP50 protein spectra to illustrate the extent of this variation, and to reveal the consequences of the customary HE-based protein selection strategy. However, before continuing, several points must be made. The first is that all of the spectra of the RaSP50 set proteins have the general character that is ‘expected’ based on their crystal structures. The next important point is that variations in protein spectra may also reflect the contributions of side chains to the spectra. For CD, the exact nature of aromatic side chain contributions can depend strongly on their local environment (Grishina and Woody 1994; Woody 1994; Woody and Dunker 1996), but most IR-active side chains are carboxylates and amines, which are found mainly at protein surfaces. Their signals should not normally be strongly perturbed by neighboring residues, and can therefore be modeled with some accuracy. There has been an active but unpublished discussion among protein IR spectroscopists about the effect of side chain bands on analysis results, so to insure that the differences shown here result from protein secondary structure, side chain bands have been subtracted from all of the IR spectra shown. Such an approach was suggested by Venyaminov in previous work (Venyaminov and Kalnin 1990).

Our brief survey begins with proteins of highly α-helical character. In order to merge the information contained in the CD and IR spectra for future investigations, hybrid CD-IR spectra were built. At the top of Figure 5, the pair of hybrid CD+IR spectra labeled A were collected from HBN and FTN, which have 68.9% and 71.3% H (α-helix) respectively, as assigned by DSSP. The helices of both of these proteins are illustrated in Figure 3; they both have no β-sheet (E) and nearly identical turn (T) and Σ other structure FCs. Their IR spectra are similar in shape, but there is an offset apparent between their infrared amide I bands (1656 and 1653 cm−1, respectively) as well as a reproducible intensity difference in their amide II bands. It should be kept in mind that this is just one example; FTN has predominantly long helices (22–28 residues) that pack against each other at roughly 20° angles. In contrast, the average length of the helices in HBN is 14 residues, and several pack at an ∼50° angle. The shape of the CD spectra of these two proteins is largely the same, but the FTN spectrum has a markedly smaller intensity. The effect of the number of residues involved in the α-helical structures was suggested by Nevskaya and Chirgadze (1976). The frequency of the amide I main component was shown to rise when its length is decreased. The CD intensities of CSA and TRO (Fig. 5B) are much more similar, but their IR spectra have distinctly different shapes. These two proteins again have very similar secondary structure FCs (59.6% and 62.3% H, 10.8% and 8.6% E, 2.3% and 3.7%T, respectively). CSA is a relatively large protein with 16 helices of ≤15 residues and four 20-residue helices; TRO has a single long helix (33 residues) around which seven shorter helices (7–12 residues) are packed. Despite the similar ratio of long and short helices in each protein, their amide I maxima are offset by 5 cm−1 (1656 versus 1651 cm−1, respectively), and have substantially different widths.

Variations between spectra of proteins with similar secondary structure FCs is not limited only to helical proteins, as is illustrated by the spectra pairs C and D in Figure 5. LOX and PGK have similar FCs (∼34%H, ∼12%E) but completely different folds (Table 1, Fig. 3). They have inflections in their IR amide I bands at 1652, and 1636–1640 cm−1 with different relative intensities that would normally suggest that PGK has a larger β-sheet content. Their CD spectra have different shapes, with the intensity of the PGK spectrum suggesting a larger α-helix content. The spectral differences that result from fold are even more pronounced in the smaller proteins RNA and UBQ (∼33% H, ∼17% E, see also Fig. 3). Both spectral data regions would normally lead to the conclusion that UBQ has a substantially higher FCE.

It is commonly accepted that CD is most effective when used for determination of the α-helix content in proteins, whereas IR is usually considered to be more accurate, that is, more consistent, for the determination of FCE. If we now examine proteins with high β-sheet content, we see that both IR and CD spectra can be substantially different for β-class proteins. The first such example, shown in Figure 5E, is a comparison of the spectra of BTE and CNA. BTE is a small protein with five β-strands (Fig. 3), and CNA contains a large, bent β-sandwich, but they both have 44% β-sheet. Their CD spectra have the expected low intensity compared to α-helical proteins, but their shapes are completely different in the 210–240 nm region. Although the contribution of aromatic residues to the CD spectra may be invoked to explain this difference, especially the positive BTE signal at 228 nm, their IR spectra are also dissimilar. The amide I maxima of these proteins are offset by 12 cm−1 (1647 versus 1635 cm−1), although the higher turn content of BTE (16% versus 11% T for CNA) could produce a small portion of the spectral difference above 1660 cm−1. The β-barrel protein AVI and the well known β-sandwich IGG provide our final examples. These both have β-sheet contents close to 52%, which is exceeded by few proteins in either the RaSP set or the PDB. The difference in their CD spectra parallels the previous example, and their IR spectra match quite well above 1660 cm−1, but at this point there is an inflection in the IR spectrum of AVI which is not present in the IGG spectrum. The AVI peak is significantly broader than that of IGG, and the shapes of their amide II bands are not at all similar.

If secondary structure content had been the sole criterion used for the selection of RaSP set proteins, only one of each pair of spectra listed in the preceding paragraphs would have been included. The discussion and spectra above clearly reinforce our assertion that although the correlation between FC and spectral band shape does follow a certain trend, many variations occur as well.


The search for an accurate and rapid method for determining protein structure from optical spectra has continued since protein secondary structure was discovered. There have been many attempts to capture the correlation between protein structure and IR or CD band shape using a wide variety of approaches. The appearance of multivariate statistical methods in the protein structure field originally looked promising because these methods are highly effective for the characterization of simple chemical systems. Unfortunately their performance has not been consistently reliable with proteins. Many groups have attempted to improve the general accuracy of secondary structure determinations by developing new mathematical methods, but the results of the present study suggest that the choice of basis proteins should also play a major role in the overall reliability of analysis.

Despite the fold-dependent spectral variations illustrated above, the structure-spectrum relationship in the basis sets of previous studies has been strong enough to allow α-helix and β-sheet contents to be determined to within 4%–12% (RMS error). However, it should be kept in mind that these numbers may be predominantly a measure of the internal consistency of the spectra included in each basis set, and may strongly reflect the extent or lack of variation, of the type shown above, that is present in each basis set spectra. Because the relationship between secondary structure “concentrations” (FCs) and spectral characteristics is complex, it is important that a basis set contain examples of spectral variations in hopes that sufficient information will be incorporated into a calibration. This information will then be available to be called upon in the analysis of a protein of unknown structure.

Since the 50 members of this rationally selected protein basis set (RaSP50) represent the widest possible range of protein structure FCs and folds, its use should facilitate new progress in the development of spectroscopic protein structure analysis or other methods which require an experimentally accessible set of proteins that are representative of known protein structures.

Materials and methods

The selection of proteins

Since explanation and justification of the methods used for identifying and accepting or rejecting basis proteins is presented in the discussion, only a summary of the tools used is provided here.

Protein selection began with a search for proteins with crystal structures available in the CATH crystal structure database (version 1.0) that were also commercially available from the Sigma or Fluka biochemical companies. The search was not restricted to single-domain proteins. CATH (http://www.biochem.ucl.ac.uk/bsm/cath_new) and SCOP (http://scop.mrc-lmb.cam.ac.uk/scop) are crystal structure databases that are organized by three-dimensional structure. Identification data for all proteins located in this search are listed in Table 2, and their structural parameters are given in Table 1. If one or two proteins with the same Class (C), Architecture (A), and Topology (T) numbers (i.e., the same fold, see Discussion) were found, the search in that C.A.T. level was terminated. Some proteins with similar folds that have been widely used in previous studies were, however, selected as well. If a protein was purchased and found to be unsuitable for inclusion in the basis set (impure or denatured), the CATH database was again consulted to find a substitute with a similar fold. The PDB_SELECT database (Hobohm et al. 1992; Hobohm and Sander 1994) was used as an additional source for identifying potential basis proteins. PDB_SELECT is a database of protein crystal structures selected to have a low level of sequence homology, and is organized in several levels with increasing homology cutoffs. The Brookhaven Protein Data Bank identification codes (PDB ID) for proteins with unique α-helix and β-sheet contents were taken from the PDB_SELECT listing at the 35% sequence homology level, and then used as the starting point in searches of the CATH or SCOP databases for a commercially available protein with the same fold.

Crystal structure quality and sequence homology

To maximize the accuracy of analyses performed using spectra from the basis set, care was taken to insure that the crystal structures used here corresponded as closely as possible to the actual structures of the proteins purchased (see Discussion). After the initial selection of potential basis proteins, the SCOP database was used to find the closest possible species match between the commercially available and crystallized form of each protein. The HSSP database was then used to compare the sequence homology of the crystallized and commercially available proteins. The SWISS-PROT database (Bairoch and Boeckmann 1991; Bairoch and Apweiler 1997 Bairoch and Apweiler 1998) was used to determine the actual number of amino acids in each protein. This number was compared with the number of residues in the crystal structure. Only the sequence of the mature protein was considered: Propeptides and signal sequences listed in SWISS-PROT were eliminated before the comparison.

The CATH database includes only crystal structures with a resolution of 3 Å or better, and thus no lower-resolution structures were considered. A database of output from the WHAT_CHECK crystal structure validation program (Hooft et al. 1996; accessible through the SCOP/PDB3D access pages) was used to evaluate the quality of the potential protein crystal structures and to choose a ‘best’ crystal structure for each protein. The main criteria were the number of unsatisfied buried hydrogen bonds and the RMS Z scores for the backbone conformation and Ramachandran plots. These quantities are described in the WHAT_CHECK on-line documentation. RMS Z scores are generated by WHAT_CHECK for different properties of a crystal structure by comparing the structure under question with a set of known high-quality crystal structures. If the crystal structure being examined is of higher quality than the reference structures, its RMS Z score will be positive. An RMS Z score is ‘poor’ if it is less than −3 (three standard deviations worse than the distribution of the reference structures), and ‘bad’ if less than −4. The RMS Z scores for the selected crystal structure of each protein are listed in Table 2.

Protein purity

Proteins were selected for acquisition from the list of potential basis set candidates based on their fold, secondary structure, cost, and the advertised purity of the commercial products. Commercial preparations known to be impure, (e.g., stabilized with BSA) were rejected. The purity of acquired proteins was examined with SDS-PAGE and Coomassie blue staining. Some contaminated proteins were successfully purified using size-exclusion chromatography with a 30×1cm Sephadex G-100 column (DPR, TMT, and UOX; see Table 2 for protein identification codes).

The processing of spectral and crystal structure information

Several computer programs were written in Array Basic (Galactic Industries, http://www.galactic.com) to process both collected spectra and reference information. Software developed for this study includes programs designed to read and tabulate various data from a series of structure assignment program output files. The assignment program DSSP (Kabsch and Sander 1983) was used for all secondary structure data presented here. Residues that were missing from the crystal structures were given an irregular structure (C) assignment. Several output files were generated, including a summary of the percentage of residues assigned to each structure type, the amino acid content, and the calculated extinction coefficients (Gill and von Hippel 1989) for each DSSP file read. The second program generates synthetic infrared side chain spectra based on the known amino acid composition of a protein and subtracts them from the protein spectrum (Venyaminov and Kalnin 1990; Goormaghtigh et al. 1996; Barth 2000). The final program was designed to combine IR and CD spectra into a single array. This program can normalize and baseline-correct IR spectra and adjust the relative scaling of CD spectra so that they fall in a similar intensity range as the IR spectra. The IR spectra presented here have been normalized to the same intensity, and CD spectra have all been converted to mean residue ellipticity (deg cm2 dmole−1) and then scaled by a constant.

Protein solutions

Proteins that were purchased as lyophilized powders were dissolved directly in 2 mM HEPES (1H2O) pH 7.2, 0.1% NaN3. For ammonium sulfate suspensions, the protein was first pelleted by centrifugation, and the excess (NH4)2SO4 solution was removed before dissolving the pellet in buffer. The initial protein solutions were made at a concentration of 4% (w/w) taking into consideration the mass of buffer salts in lyophilized powders, if present. Small molecules from the commercial preparations were removed by extensive dialysis against HEPES buffer (2mM, pH 7.2, 0.02% NaN3) at 4°C, or by passing the sample through a 0.7×4cm Sephadex G-25 (Pharmacia) size-exclusion centrifuge column equilibrated with this buffer. The desalting was repeated for proteins with (NH4)2SO4 or high concentrations of other salts. The high NaN3 concentration in the initial solution allowed the effectiveness of desalting to be verified for each sample because the N3 ion has a characteristic IR band at 2048 cm−1. For the proteins IGG, LCL, and ADH, the solutions were clarified by centrifugation before use. Stock solutions were made by adjusting the concentration of the desalted proteins to 3% by the addition of buffer or by concentration with a Microcon 3 or 10 (Amicon) where necessary. Stock solutions were used directly for transmission IR measurements. For CD measurements, the stock solution was diluted to ∼0.01% with 2 mM HEPES (1H2O) pH 7.2, without NaN3 because NaN3 absorbs strongly at low wavelengths. Exceptions to the general procedure are as follows: CNA was maintained at pH 5.2 during all manipulations. INS was dissolved at pH > 10, and the KOH solution was immediately exchanged for the standard HEPES buffer pH 7.2 with a Sephadex G-25 centrifuge column (2×); the final pH was 7.2.

CD spectroscopy

CD spectra were recorded in a 1-mm cell on a JASCO J-710 spectrometer (calibrated with CSA) and constantly purged with N2 at 5 l/min. Each spectrum was the accumulation of eight scans with a 1-nm slit width, a time constant of 0.5 sec, and a scan rate of 50 nm/min, for a nominal resolution of 1.7 nm; total collection time was 13 min. The protein concentrations in filtered solutions were adjusted to give a detector voltage of 495±15 V at 185 nm, which provided a good noise level and avoided flattening of the spectra. The absorbance spectra of the samples from 185–260 nm were obtained simultaneously and examined using the JASCO software. After background subtraction, the CD samples typically had an absorbance of ∼0.7 AU at 192 nm, and gave raw CD intensities larger than −5 mdeg in the 200–230 nm region. For analysis, background-corrected spectra were converted to mean residue ellipticity (deg dmole−1 cm2) using a concentration determined from the absorbance at 205 nm. They were then scaled by a constant (0.0015) to provide intensities similar to those of the processed infrared spectra. The protein extinction coefficient at 205 nm was calculated based on the number of peptide bonds in the protein and an assumed peptide bond ε205 of 5167 l mole−1 cm−1, which is based on a combination of previously published values (Hennessey Jr. and Johnson Jr. 1981; Scopes 1987). With the instrument used here, mean residue ellipticities calculated using A205 were more consistent than those based on A192, which has been used in some other studies. The accuracy of this value was confirmed by comparison with concentrations determined from A280 for proteins with known ε280 values (Gill and von Hippel 1989), and also by comparison with published spectra when possible.

Infrared spectroscopy

Infrared spectra were collected on a dry-air-purged Bruker IFS-55 spectrometer with a liquid N2-cooled MCT detector; 512 scans were accumulated for each spectrum at a resolution of 2 cm−1. Spectra were collected using the 3% protein stock solutions (1H2O) placed between CaF2 windows separated with a 5-μm Teflon spacer. The buffer spectrum was subtracted with the help of a software package, written in Array Basic by K. Oberg. This program designed for the iterative optimization of various subtractions that are commonly encountered in protein infrared spectroscopy. Here, the subtraction scaling factor was adjusted so that the slopes of the baselines from 1990–1900 and 1850–1740 cm−1 were the same (Powell et al. 1986). The buffer spectrum subtraction scaling factors determined typically ranged from 0.98 to 1.01. Water vapor signal was removed using the area of the vapor peaks at 1717 or 1772 cm−1 to determine the subtraction-scaling factor. The intensities of the protein spectra collected in this study were in the range of 10 to 55 mAU, with a typical spectrum having an intensity of 35–40 mAU. The RMS noise level from 2200–2100 cm−1 was 9.8×10−6AU, giving signal-to-noise ratios of 1000– 5600, with an average value around 3700. A more conservative estimate, based on the RMS noise level in the amide I region in the presence of buffer (1.8×10−5AU), gives signal-to-noise ratios of 550–3000, with an average value around 2000.

Table Table 1.. Structural characteristics of proteins included in the RaSP50 set
 Secondary structurebCATH numbersc
RaSP codea%H%E%T%GRemainderDomain (1)Domain (2)Domain (3)Domain (4)
      3.30.456.10 1.10.439.10 
ATX1.055.712.23.527.7NC (2)   
 SCOP classificationd 
RaSP codeaClass (1)Fold (1)Class (2)Fold (2)Superfold familye
  • a

    a Mnemonic code used to identify proteins in this study. Proteins are sorted by CATH numbers (class: all α, α and β, then all β) to emphasize structural relationships.

  • b

    b Protein secondary structure fractional compositions (FCs) as assigned by DSSP. %H, % α-helix; %E, % β-sheet; %T, % turn; %G, % 310 helix; remainder, residues with all remaining assignments combined (unassigned + B + S + I).

  • c

    c CATH assignments for all unique domains in each protein. Some proteins have more than one domain with the same CATH assignment. For brevity, repeated CATH assignments for individual proteins are combined into a single entry in the table. Light gray shading indicates domain folds for which substitute proteins were found and acquired. Darker gray shading indicates proteins in the RaSP50 set with redundant domain folds. Redundant RaSP50 proteins were selected because of unique FCH or FCE, and/or frequent use in other basis sets. They will be utilized in future structure-band shape relationship studies. Parenthesized digits following NC (not classified) are for proteins that have not been assigned CATH numbers, and represent the authors' conjecture of the class to which the protein may eventually be assigned. The CATH database, because it is still new, is undergoing constant revision, and therefore the CATH assignments listed here may change in future releases.

  • d

    d First two levels of the SCOP assignments for each protein. Symbols used for SCOP classes include: α, all-α-helical domains; β, all-β-sheet domains; α/β, domains with predominantly β–α–β secondary structural units (parallel β-sheets); α+β, domains with segregated α-helical and β-sheet regions (antiparallel β-sheet); α,β, domains that have distinct regions of different classes; S, small proteins with little regular secondary structure; P, short peptides; M, membrane proteins (typically α-helical). Abbreviated SCOP fold names are listed to provide a common-name indication of the fold. The RaSP code may be given if the SCOP fold name is the same as the protein name.

  • e

    e This column indicates the presence of one or more of the nine most common domain folds (superfolds) in a given protein. Superfold family membership was determined by identifying the CATH numbers of the proteins listed by Orengo et al. (1994). All RaSP proteins with an identified superfold CATH number were assumed to be members of that superfold family.

GSTαGST, C-termα/βThioredoxin-like 
CYCαCytochrome c   
HBNαGlobin-like  globin
MBNαGlobin-like  globin
COLMToxins' translocation  globin
PERαHeme PERs   
LOXαLOXβColipase binding 
APEα4 helix bundle  α-up-down
UBQα+ββ-Grasp  UB αβ roll
PAPα+βCys proteinases   
TPIα/βTIM barrel  TIM barrel
PAHα+βNtn hydrolases   
SBCα/βSubtilases  α/β doubly wound
SBNα/βSubtilases  α/β doubly wound
UOXα/βFAD/NAD-bindingα+βFAD reductases 
SDFα+βSDF C-termαLong α-hairpin 
ADHα/βRossmann-foldβGroES-likeα/β doubly wound
BTESSnake toxin-like   
CTGβSerine proteases   
TGNβSerine proteases   
LCLβConA-like lectins  jelly roll
CNAβConA-like lectins   
SODβIG-like sandwich  IG-Greek key
IGGβIG-like sandwich  IG-Greek key
XYNβConA-like lectins   
PEPβAcid proteases   
PGNβAcid proteases   
RENβAcid proteases   
Table Table 2.. Identification and quality data for proteins included in the RaSP50 set
  Commercial Preparation
RaSP CodebProtein nameCatalog numbercCostdE.C.eEst. % PurityfSwiss Prot IDgSource
ADHAlcohol dehydrogenaseA6128L1.1.1.185ADHE_HORSEhorse liver
ALAα-LactalbuminN/AG2.4.1.2298LCA_HUMANhuman milk
APEApolipoprotein E3N/AG100APE_HUMANhuman
ATXα-Hemolysin (alphatoxin)N/AG100HLA_STAAUStaphylococcus aureus
AVIAvidinA9257M98AVID CHICKhen egg white
BTEErabutoxin bE4888X100NXSI_LATSELaticauda semifasciata
CAHCarbonic anhydraseC3934I4.2.1.185CAH2_BOVINbovine erythrocyte
CNAConcanavalin AL7647I100CONA_CANENjack bean
COLColicin A, C-terminal domainN/AG100not foundbacterial
CSACitrate synthetaseC3260M4.1.3.795CISY_PIGporcine heart
CTGα-Chymotrypsinogen AC4879I3.4.21.1100CTRA_BOVINbovine pancreas
CYCCytochrome cC7752I100CYC_HORSEhorse heart
DPRDihydropteridine reductaseD6888H1.6.99.790 (S)not foundsheep liver
FTNFerritin (apo)A3461I90FRIL_HORSEhorse spleen
GSTGlutathione S-transferaseG6511M2.5.1.1890not foundequine liver
HBNHemoglobinH2500I95HBA_BOVINbovine blood
IGGImmunoglobulin γ56834L99not foundhuman
INSInsulin15500I100INS_BOVINbovine pancreas
LCLLectin, lentilL9267M100LEC_LENCUlentil
LSZLysozymeL6876I3.2.1.17100LYC_CHICKchicken egg white
MBNMyoglobinM1882I95MYG_HORSEhorse heart
MONMonellinM7755M95MONA_DIOCU MONB_DIOCUDioscoreophyllum cumminsii
MTHMetallothionein IIM5392H?MT2A_RABITrabbit liver
OVAOvalbumin (egg albumin)A5503I90OVAL_CHICKhen
PABParvalbuminP6393X98PRVA_RABITrabbit muscle
PAHPenicillin amidohydrolaseP3319I3.5.1.1185PAC_ECOLIEscherichia coli
PAPPapainP3125L3.4.22.2NTPAPA_CARPApapaya latex
PEPPepsinP6887I3.4.23.1100PEPA_PIGprocine stomach
PERPeroxidaseP4794M1.11.1.795PER_ARTRAArthromyces ramosus
PGKPhosphoglyceric kinaseP7634M2.7.2.395PGK_YEASTbaker's yeast
PGNPepsinogenP4781L3.4.23.190PEPA_PIGpig stomach
PLAPhospholipase A2P8913L3.1.1.4?PA2_BOVINbovine pancreas
R61DD-transpeptidaseN/AG3.4.16.4NTDAC_STRSQStreptomyces r61
RENRennin (chymosin b)R4879M3.4.23.490CHYM_BOVINcalf stomach
RICRicinL9639X3.2.2.2299RICI_RICCOcastor bean
RNARibonuclease AR5500L3.1.27.5100RNP_BOVINbovine pancreas
SBCSubtilisin CarlsbergP5380I3.4.21.6290SUBT_BACLIBacillus licheniformus
SBNSubtilisin BPN' (nagarse)P4789I3.4.21.6280not foundnot specified
SDFSuperoxide dismutase (Fe)S5389H1.15.1.190SODF_ECOLIEscherichia coli
SODSuperoxide dismutase (Cu,Zn)86200L1.15.1.185SODC_BOVINbovine erythrocyte
TGNTrypsinogenT1143I3.4.21.485TRYP_BOVINbovine pancreas
TIBTrypsin inhibitor (soybean, Bowman-Burke)T9777L3.4.21.490IBB1,2,3_SOYBNsoybean
TIPTrypsin inhibitor (BPTI)T0256H100BPT1_BOVINbovine pancreas
TMTThaumatinT7638L95 (S)THM1_THADAThaumatococcus daniellii
TPITriose phosphate isomeraseT2507M5.3.1.195TPIS_YEASTbaker's yeast
TROTroponinT1771X85TPCS_CHICKchicken muscle
UBQUbiquitinU6253L100UBIQ_HUMANbovine erythrocyte
UOXGlucose oxidaseG7016I1.13.495 (S)GOX_ASPNGAspergillus niger
XYNXylanase95595H3.2.1.890not foundTrichoderma viride
 Crystal structureWhat_Check scoresa
RaSP Codeb% Res Missingh%IdeiSourcePDB IDResjR factorRamachandranBack bone# Bad H-bonds
  • a

    a Scores assigned by WHAT_CHECK indicating the quality of particular characteristics of crystal structures as compared to known high-quality crystal structures. The # Bad H-bonds column lists a count of the number of hydrogen bond donors or acceptors buried in the protein without apparent partners. The Ramachandran and Backbone values are RMS Z scores. Each crystal structure is given several RMS Z scores by WHAT_CHECK which compare its quality to a set of known high-quality structures. Positive RMS Z scores indicate high-quality structures (see text). Scores below −3 indicate potential problems with the crystal structure, and scores below −4 were used as grounds for rejecting crystal structures with serious flaws.

  • b

    b Mnemonic code used to identify RaSP set proteins in this text and subsequent publications. Proteins are sorted alphabetically by RaSP code in this table.

  • c

    c Catalog numbers for proteins purchased from Sigma begin with a letter. Proteins from Fluka are indicated by a 5-digit number. Those marked with N/A were gifts from collaborators.

  • d

    d Cost Codes: 1, inexpensive, less than $1 per milligram; L, low, $1–5 per mg; M, Moderate, 5–20 per mg; H, high, $20–50 per mg; X, extremely high $50–200 per mg; P, prohibitive, >$200 per mg; U, became unavailable commercially during the time the basis set was being constructed; G, gift from collaborators.

  • e

    e E.C.#: Enzyme classification number.

  • f

    f Purity estimated from SDS PAGE. NT, not tested; ?, no bands observed. Other indications of the protein suitability for use are indicated by the following: S, protein purified by size-exclusion chromatography.

  • g

    g SWISS-PROT identification code for commercially available preparation. SWISS-PROT IDs for crystallized proteins are available through the CATH protein listings.

  • h

    h Percentage of total residues in protein not included in the crystal structure. These values were determined by comparison with the SWISS-PROT listing for the crystallized protein. Negative numbers indicate crystal structures with more residues than listed in SWISS-PROT.

  • i

    i Percent identify between crystal structure sequence and the SWISS-PROT sequence listed for the commerically available protein as listed in the HSSP database.

  • j

    j Resolution and R-factor of the crystal structure selected as best for the protein.

ADH0.00100horse liver2ohx1.800.173−1.336−0.27345
ATX0.00100Staphylococcus aureus7ahl1.900.199−1.256−0.683194
AVI3.1399hen egg whitelavd2.700.174−4.2850.14358
BTE0.0098Laticauda semifasciata3ebx1.400.1401.274−0.06815
CAH0.3979human erythrocytes1hcb1.600.177−2.724−0.28215
CNA4.0582jack bean1scs1.600.1780.3091.04413
COL1.50?Escherichia coli1col2.40N/A−1.1450.34517
CSA−0.4694chicken heart1csh1.600.1640.4930.30930
CTG0.00100bovine pancreas2cga1.800.173−2.961−0.13740
CYC0.00100horse heart1hrc1.900.179−3.300−1.08513
DPR2.07?rat liver1dhr2.30N/A0.2921.07917
FTN0.0099horse spleen1hrs2.600.180−3.248−0.3287
GST0.00?rat liver2gst1.800.1600.362−0.79155
INS8.93100bovine pancreas1bph2.000.160−0.804−1.2788
LCL−8.06100lentil seed1len1.800.175−2.5721.23732
LSZ0.00100hen egg white1hel1.700.152−0.041−1.5237
MBN0.0099horse heartlymb1.900.155−3.1240.50320
MON1.05100Dioscorcophyllum cumminsii1mol1.700.174−0.0620.6538
MTH1.6185rat liver4mt22.000.1970.572−1.28817
PAB0.0087rat muscle1rtp2.000.1810.0291.54411
PAH2.22100Escherichia coli1pnk1.900.190−0.311−0.02555
PAP0.00100papaya latex1ppn1.600.159−1.353−1.25323
PER2.33100Arthromyces ramosus1arv1.600.1780.4010.81921
PGK0.0098baker's yeast3pgk2.50N/A−5.209−9.14375
PLA0.0099bovine pancreas1bp21.70N/A1.2150.9337
R610.57 Streptomyces r613pte1.600.1481.6610.29520
REN0.93100bovine (calf)4cms2.200.158−1.823−0.26716
RIC0.00100castor bean2aai2.500.212−3.756−1.86766
RNA0.00100bovine pancreas6rat1.500.152−0.8530.5986
SBC0.0098Bacillus subtilis1cse1.20N/A−0.6530.09214
SBN0.00?Bacillus amyloliquefaciens2st11.80N/A0.0740.28913
SDF0.00100Escherichia coli1isc1.800.1880.153−0.59519
SOD0.00100bovine erythrocyte1sxc1.900.560−0.388−0.9668
TGN0.00100bovine pancreas2tga1.80N/A−2.0761.17128
TIP0.00100bovine pancreas1bpi1.100.146−0.466−1.49511
TMT0.0099Thaumatococcus daniellii1thw1.750.1810.0260.04919
TPI0.00100baker's yeast7tim1.900.183−3.3390.04950
TRO0.00100chicken muscle1top1.780.1682.8770.9147
UBQ0.00100synthetic (human?)1ubi1.800.1651.1992.1314
UOX0.34100Asperigillus niger1gal2.300.181−1.427−1.91163
XYN0.00?Trichoderma reesei1xyp1.500.1890.000−0.81016
Table Table 3.. Secondary structure FC summary for 37 nonmutant hen egg white lysozyme (LSZ) crystal structures
 Secondary structure fractional compositiona
  • a

    a All fractional compositions (FCs) were tabulated from the DSSP assignments. FCs are expressed as the percentage of residues in the crystal structure with a given secondary structure.

  • b

    b The crystal structure with the PDB identification code 1hel was selected for use as the highest quality LSZ reference. The selection criteria are described in the text.

%H (α-helix)28.6832.563.8830.331
%E (β-sheet)6.209.303.106.356.20
%T (Turn)19.3827.137.7523.5823.26
%G (310-helix)4.6510.856.2010.2110.85
%Remainder (C ÷ B + S)26.3539.5313.1829.4828.68
Figure Figure 1..

CATH space representation of all domain folds in the RaSP set. A comparison of the C (Class), A (Architecture), and T (Topology) numbers of protein domains from CATH version 1.1. Squares represent protein domains that are members of superfold families (Table 1); non-superfold domains are represented with circles. (▪, •) Proteins used in the RaSP50set; (□, ○) proteins acquired but discarded for reasons described in the text. Solid bars represent all existing CATH numbers, that is, all known protein folds. Architectures are shown in increasing order for each class. For example, α/β roll proteins (Class = 3) have the Architecture number A = 10; α/β barrel proteins A = 20, etc. For Class = 2 proteins, those with the β-ribbon Architecture have A = 10 and so on. Abbreviated Architectures are as follows: 4-sandwich, four-layer sandwich; dist. sandwich, distorted sandwich; orth. prism, orthogonal prism; alig. prism, aligned prism. For detailed definitions of these names, see the CATH Lexicon in the CATH web site. Topology numbers have been assigned in increments of 10, as have Architecture numbers (see Table 1). To illustrate the actual number of existing Architectures in the figure, the assigned topology values were divided by 10.

Figure Figure 2..

Helix-Sheet (HE) space representation of the RaSP set protein FCs. A comparison of the relative amounts of α-helix and β-sheet contents (FCs) in different groups of proteins. RaSP protein symbols are described in the Figure 1 legend. Small points represent proteins from the PDB_SELECT database, and are included to provide a reference for the range of existing FC values.

Figure Figure 3..

Ribbon representations of key structural elements for selected RaSP50 proteins generated with MOLSCRIPT (Kraulis 1991). These images illustrate some of the different types of protein folds, tertiary interactions, and distortions of secondary structure present in the RaSP50 set. Nonperiodic structural segments in most proteins have been omitted for clarity. Spectra of some of these proteins appear in Figure 5.

Figure Figure 4..

Structure distributions in the RaSP50 basis set and the PDB_SELECT database (35% sequence homology cutoff). Solid circles in these histograms show the number of instances for each structure category in the RaSP50 basis set (left y axis of each plot). The equivalent distribution from the PDB_SELECT database is indicated by the solid line in each histogram (right y axes). Plots represent the following protein characteristics as assigned by the DSSP program. Left panel: H, % α-helix; E, % β-sheet; T, % Turn; C, % unassigned structures (Note: This does not include the B, S, or any other assignments). The G (310 helix), B (isolated extended), and S (bend) structures all have narrow distributions, with few or no instances in the 20%–30% and higher categories. Right panel: H-L, lengths (number of residues) of individual α-helices; E-L, lengths of individual β strands; E-S/S, size of individual β-sheets (β-strands per β-sheet).

Figure Figure 5..

Hybrid IR+CD spectra of some RaSP50 set protein pairs with similar secondary structure FCs but different folds. These demonstrate the spectral consequences of various differences in tertiary structure (Fig. 3). Spectral pairs have been offset for clarity. The infrared spectra have been normalized to the same maximum intensities. The relative intensities of the CD spectra are proportional to their mean residue ellipticities. The RaSP coordinate system that was used in analyses is accompanied by the native units for each data type (CD and IR) below the plot. The RaSP codes for the proteins whose spectra are shown are as follows: trace pair A, Ferritin (FTN, ——) and hemoglobin (HBN, - - - -); B, citrate synthetase (CSA, ——) and troponin (TRO, - - - -); C, lipoxygenase (LOX, ——) and phosphoglycerate kinase (PGK, - - - -); D, ubiquitin (UBQ, ——) and ribonuclease A (RNA, - - - -); E, erabutoxin (BTE, ——) and concanavalin A (CNA, - - - -); F, avidin (AVI, ——) and γ immunoglobulin (IGG, - - - -).


This work was funded by a grant from the Belgian National Fund for Scientific Research, TELEVIE (contract 7.4549.95F, Belgium), and ARC (Action de Recherche Concertées). E.G. is Research Director of the Belgian National Fund for Scientific Research. We also thank our collaborators for generously providing us with proteins that are not commercially available: Robert O. Ryan, University of Alberta, Edmonton, Canada (apolipoprotein E3, APE); Martine Nguyen-Distéche, University of Liége, Belgium (DD-transpeptidase, R61); Daniel Baty, CNRS, France (Colicin A, COL); Sucharit Bhakdi, University of Mainz, Germany (α-hemolysin, ATX); and Valentina Bychkova, Russian Academy of Sciences, Moscow (human α-lactalbumin, ALA).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.