Participation of protein sequence termini in crystal contacts



The analysis of the crystal packing interactions, in a nonredundant set of high resolution and monomeric globular protein crystal structures, shows that the residues located at the N- and C-termini of the sequence tend to participate in packing interaction more often than expected and that often they interact with each other. Since the sequence termini are, in general, conformationally very flexible and since they host electrical charges of opposite sign, it can be hypothesized that they play a crucial role in the early formation of the nonphysiological contacts that bring to protein crystallization. It is thus not surprising that modest lengthening/shortening of the sequence termini have often a dramatic effect on protein crystallogenesis.


It is well known that it is extremely difficult to crystallize proteins1 and it has been hypothesized that evolution worked against protein crystallization.2 However, protein crystallization is of crucial importance as it is a mandatory prerequisite of crystal structure determinations. Protein crystallographers, who need to crystallize proteins, commonly adopt multiconstruct strategies, where many different constructs of the same protein are used by lengthening/shortening the amino acid sequence.3, 4 Many crystallographers remember some anecdotic case in which well diffracting crystals were obtained by removing/adding just one or two residues at the sequence termini. These “miracles” can have various interpretations.

First, it is possible that minor modifications of the amino acid sequence may improve the folding free energy with the consequence that proteins are better folded and conformationally more homogeneous. However, the folding energy perturbations are likely to be quite modest, since both the N- and the C-terminus tend to be solvent exposed and it is well known that mutations/insertion/deletions at the protein surface do not affect considerably the protein folding thermodynamics, especially if they do not occur in helices and strands.5, 6 It is also possible that subtle changes at the sequence termini might influence the overall solubility, with consequences in the crystallization process.7 And it is also possible, under the hypothesis of a cotranslational folding mechanism at the ribosome level, that the length of the N-terminal segment, which would be sequestered, directly or indirectly, by the ribosome, might have an importance in the expression of well-folded proteins.8

Successful crystallogenesis requires large amounts of chemically and conformationally pure protein. Flexible N- and C-termini, even if relatively short, increase conformational heterogeneity of the sample and can therefore hamper crystallization.

In the present communication, a different explanation of the critical relationship between protein crystallization and amino acid sequence termini is presented. It is based on the observation that N-and C-termini are involved in crystal packing contact more often than expected. They seem to be often crucial in stabilizing the protein solid state phase, and thus it is not surprising to observe that small lengthening/shortening of the sequence termini are so important in determining protein crystallization.

Results and Discussion

The propensity P(TP) of the N- and C-terminal residues to be solvent accessible and to form crystal packing interactions is defined as

equation image

where nS is the number of solvent accessible residues, nTS is the number of solvent accessible residues that are at the sequence termini, nSP is the number of residues that are solvent accessible and form crystal packing contacts, and nTSP is the number of sequence terminal residues that are solvent accessible and that form crystal packing contacts. Analogously, it is possible to define the propensity P(CP) of C-terminal residues to be solvent accessible and to form crystal packing contacts and the propensity P(NP) of N-terminal residues to be solvent exposed and to form crystal packing contacts.

By definition, propensity values higher than 1.0 indicate that terminal residues tend to be found at the protein surface and to form crystal packing contacts more frequently than expected.

The propensities of terminal residues to be involved in crystal packing interactions are shown in Table I. They were computed by summing the nTSP, nSP, nTS, and nS values over all the protein crystal structures and by considering the first Nt and the last Ct residues to be at the N- and at the C-terminus respectively (Nt = 1,2,3,4, and 5; Ct = 1,2,3,4, and 5). It appears that the values of propensity are considerably larger than 1.0. This clearly shows that the residues close to the sequence termini participate in crystal packing interactions more often than expected. This conclusion is reinforced by the observations that the propensities are larger than 1.0 in the majority of the individual crystal structures (Table I).

Table I. Overall Propensity Values, Computed by Summing all the Variables that Define P(TP) Over the 399 Structures Examined in This Article, and Percentage of Structures where the Individual Propensity P(PT) is Larger than 1. Five Values of Nt and Ct (Collectively Named Tt in the table) were Considered
TtP(TP)% Cases with P(TP) > 1

It is also interesting to observe that the crystal packing contacts between N- and C-terminal residues are rather frequent. For example, for Nt = Ct = 1, only 2.9% (±0.1) of the solvent accessible residue belong to the sequence termini. However, 6.3% (±1.0) of the crystal packing contacts link N- and C-terminal residues. Similar results are observed for other values of Nt and Ct. On the contrary, contacts between sequence termini and other regions of the proteins do not show any tendency to involve particular chemical groups rather than others.

Interestingly, the crystal symmetry is not correlated with the presence of packing interactions that involve the protein termini. The distributions of the space groups are very similar in crystals with contacts involving N- and/or C-termini and in crystals lacking these interactions. They are also closely similar to the space group histograms distributed by the protein data bank (PDB) ( Analogously, the presence of screw axes is not associated with packing contacts that involve the protein termini.

It must also be observed that no difference was observed for between residues at the N-terminus and the residues at the C-terminus. Both termini have the same propensity to be involved in crystal packing contacts in the structures examined here.

Propensities were also calculated for different levels of relative solvent accessibility (Table II) and no clear trends appeared. For example, for Nt = Ct = 1, the highest propensity of terminal residues to be involved in crystal packing interactions is observed for amino acids that have a relative solvent accessibility in the 20–40% range. On the contrary, for Nt = Ct = 5, the propensity tends to decrease if the accessibility to the solvent increases.

Table II. Numerical Values Necessary to Compute The Propensities P(TP) for Different Ranges of Relative Solvent Accessibility (SAA) and for Two Nt and Ct (Collectively Named Tt in the Table)
Tt = 1
< 20%2259233308360.721
> 80%308306944750621.136
Tt = 5
< 20%1242592826308361.786
> 80%591306982750621.179

The high frequency of packing interactions involving and between N- and C-termini suggests that they frequently clash in solution, especially at the high concentrations that precede nucleation and crystallization. This might depend on the opposite charges of the N-terminal ammonium and of the C-terminal carboxylate, which attract each other. However, attractive Coulombic interactions cannot be limited only to the sequence termini but might involve also the numerous cationic or anionic side-chains that are usually disseminated on the globular protein surface. Therefore, it seems more alike that the importance of the electric charges at the sequence termini is due to the spatial delocalization. The C- and N-termini tend to protrude from the globular core and to fluctuate in the solvent surrounding the protein. Because of this conformational freedom, the electric charges at the sequence termini can explore several positions with a consequent high probability of interacting with other groups. This feature is typical of the sequence termini.

Modest lengthening/shortening of the N- and C-terminal moieties can thus have severe effects on the formation of nascent, intermolecular interactions that lead to the crystallization. Obviously, the relationship between the length of the termini and the crystallization tendency depends not only on the termini but also on the overall shape, size, and plasticity of the globular protein. It is therefore impossible to design a detailed strategy for exploiting this relationship in practical cases with the consequence that that systematic experimental approaches are used to increase the success rate of crystallization.9, 10 It must however be kept in mind that an important factor in determining the production of high quality protein crystal is likely to be the length of the sequence termini, which seem to work like the fingers that recognize the aircrafts in the airports and allow the boarding of the passengers.

Material and Methods

The attention was focused on monomeric proteins (experimental annotations from the UniProt database11). This was necessary in order to make the dataset as homogeneous as possible: in a dimer, for example, the N-terminus of a protomer interacts with three other sequence termini, which are the C-terminus of the same protein chain and the two termini of the other molecule. Such a situation is completely different from the case of the monomeric proteins, where there are only two sequence termini of opposite charge. Membrane proteins and proteins containing coiled-coils were also disregarded. Only X-ray crystal structures were retained from the PDB12, 13 and those with crystallographic resolution lower than 2.0 A were disregarded. Sequence redundancy was reduced to 90% of identity to reject multiple copies of the same structure, mutants, isoforms, and complexes with different small molecules. A more severe redundancy cutoff was not necessary, since crystallogenesis and sequence homology are likely to be unrelated. Structures with electron density gaps were also ignored. They were detected by looking at the following lines of the PDB files: The REMARK 465 lines, which list the residues that lack completely the positional coordinates; the REMARK 470 lines, which list the nonhydrogen atoms of the amino acids that do not have positional coordinates; and the REMARK 475 lines, which enumerate the residues modeled with zero occupancy. Since most of these missing atoms are at the protein surface, structures that contain them must be eliminated when the crystal packing contacts are examined. Also structures with residues conformationally disordered were eliminated, since different solvent accessibilities can be measured for different conformations not only for the disordered residue but also for the amino acids that surround it. Eventually, structures containing nonstandard amino acids were disregard, since the solvent accessible surface of these residues cannot be estimated by the program NACCESS. 399 protein crystal structures were eventually retained.

Crystal packing contacts were determined with the program cpc.c, which is based on a cutoff interatomic distance of 4.5 Å and was successfully used in previous studies.14, 15 Solvent accessibilities were determined with the program NACCESS (with default parameter values; that provides relative solvent accessible surface areas for each residues (these are the ratios between the solvent accessible area of the residue X in the protein and the solvent accessibility of the residue X in the tripeptide AXA in an extended conformation). A residue was considered to be accessible to the solvent if its relative accessibility was larger than zero. Residues were considered to be at the N-terminus if they were within the first Nt sequence positions (Nt = 1, 2, 3, 4, and 5). Analogously, the C-terminal residues were identified within the last Ct positions (Ct = 1, 2, 3, 4, and 5). A variety of Nt and Ct values was used to reinforce the results.


Helpful discussions with Kristina Djinovic-Carugo (Department of Structural and Computational Biology, MFPL-Vienna) are gratefully acknowledged.