Correspondence to: de Brevern Alexandre G., INSERM UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Denis Diderot, Sorbonne Paris Cité 7, INTS, 6, rue Alexandre Cabanel, 75739 Paris cedex 15, France. E-mail: email@example.com (or) Craveur Pierrick, INSERM UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Denis Diderot, Sorbonne Paris Cité 7, INTS, 6, rue Alexandre Cabanel, 75739 Paris cedex 15, France. E-mail: firstname.lastname@example.org
β-Sheets are quite frequent in protein structures and are stabilized by regular main-chain hydrogen bond patterns. Irregularities in β-sheets, named β-bulges, are distorted regions between two consecutive hydrogen bonds. They disrupt the classical alternation of side chain direction and can alter the directionality of β-strands. They are implicated in protein-protein interactions and are introduced to avoid β-strand aggregation. Five different types of β-bulges are defined. Previous studies on β-bulges were performed on a limited number of protein structures or one specific family. These studies evoked a potential conservation during evolution. In this work, we analyze the β-bulge distribution and conservation in terms of local backbone conformations and amino acid composition. Our dataset consists of 66 times more β-bulges than the last systematic study (Chan et al. Protein Science 1993, 2:1574–1590). Novel amino acid preferences are underlined and local structure conformations are highlighted by the use of a structural alphabet. We observed that β-bulges are preferably localized at the N- and C-termini of β-strands, but contrary to the earlier studies, no significant conservation of β-bulges was observed among structural homologues. Displacement of β-bulges along the sequence was also investigated by Molecular Dynamics simulations.
Protein 3D structures are often described as a succession of repetitive secondary structures[1, 2]: (i) α-helix (or 3.613 helix) characterized by intramolecular hydrogen bonds between amino acid residues i and i + 4 and (ii) β-sheet composed of extended chains with hydrogen bonds between adjacent chains. A major difference between these two main regular secondary structures is the nonlocal nature of hydrogen bonds. In case of β-sheet, hydrogen bonding partners can be far from each other in the sequence. Helical structures represent 1/3rd of residues while extended structures account for 1/5th.[4, 5] Depending on the strand orientation, a β-sheet can be parallel, anti-parallel or mixed, resulting in different hydrogen-bonded patterns. The planarity of β-sheet arrangement results in a periodicity in the side-chain orientation, pointing alternatively on both sides of the sheet. The sequence specificity of β-strands and their capping regions has been widely analyzed.[7, 8] Prediction of β-sheets structure is difficult[9-11] as β-sheet assembly is more complex than simple pair complementarities.[6, 12, 13]
Like the helices, which are often nonlinear structures, the β-sheets also show irregularities, named β-bulges. A bulge is formed when extra residues are inserted between successive hydrogen bonds stabilizing the β-sheet, so that usually two or more residues are on one strand opposite a single residue on the other (which is named X).[16, 17] They are mainly observed between anti-parallel β-strands and more than two bulges per protein are found on an average. Richardson and coworkers were the first to analyze these conformations identifying 91 β-bulges in 28 proteins and proposed the first classification into 3 types: Classic, G1, and Wide. Milner-White analyzed the relation between G1 β-bulge and the β-hairpins describing a particular G1 β-bulge loop, related to turns.[17-19] Thornton and coworker made the first systematic study of β-bulges on a nonredundant dataset. They analyzed 362 β-bulges and extended the earlier classification based on the conformation and hydrogen-bond patterns. Consequently, they introduced two new classes: the Bent and the Special β-bulges.
β-Bulges are thus grouped into 5 classes according to backbone conformation and hydrogen-bonding patterns (see Table 1 for more details): (i) the Classic β-bulge occurs between a narrow pair of hydrogen bonds; (ii) the G1 β-bulge occurs only between anti parallel β-strands; (iii) the Wide β-bulge occurs between the widely spaced pairs of hydrogen bonds; (iv) the Bent β-bulge, which do not have residue X, have the residues 1 and 2 inserted in each β-strand; and (v) the Special β-bulges that are formed when more than two residues are inserted in the bulged strand. The term “β-bulges residues” in the article refers to all residues which compose the β-bulges (at positions X, 1, 2, 3, and 4) unless otherwise indicated.
Table 1. β-Bulge Type Definition [Color Table can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Hydrogen bonding patterns
3D conformation examples
In the hydrogen bonding patterns diagram, directional arrows represent oriented β-strands, squares represent residues in β-strands, oval represent residues out of β-strands, and same color residues have their side chains pointing in the same direction. As all β-bulge types (except Bent) could be divided in subcategories, only the most representative hydrogen bonding pattern of the classes have been presented [See Chan, Hutchinson, Harris, and Thornton (Prot Sci, 1993), for frequencies and details on subtypes].
The conformation of residue 1 is nearly αR-helical. Residues 2 and X are close to the extended β conformation. All three residues have their side chains pointing in the same direction.
In parallel β-strands, the position 2 adopts the αR-helical conformation and position 1 and X are in β conformation. In this case, the side chains of residues 1 and 2 point to opposite directions
Residue 1 is often a Glycine with dihedral angles in αL region. The hydrogen-bonding pattern is similar to that of the Classic β-bulges, except that residue 1 is always at the beginning of the β-strand (or at the end of a loop).
The G1 β-bulge occurs only between anti parallel β-strands.
Wide β-bulge occurs between the widely spaced pairs of hydrogen bonds. As the residues 1 and 2 are not involved in any main-chain bonding they can adopt many different conformations.
In anti-parallel case the residues conformations are either αL − αR or αR − αL.
In parallel case the conformation of both residues is αR.
Special β-bulges are very similar to the classic type, but with more than two inserted residues in the bulged strand.
The extra residues do not only disrupt the normal alternation of side chain direction, but also have an effect on the directionality of β-strands; they tend to accentuate the typical right-handed twist of β-sheets. Hence, β-bulges are expected to be well conserved in proteins and a good example of this was reported for SH3 domain protein, where a β-bulge was found to be highly conserved.
β-sheets are the least exposed and most rigid secondary structure regions. As β-bulges are more exposed than other strand residues, they seem to play a key role in protein-protein interaction, protein folding[23, 24] and other functions.[22, 25] It was also suggested that they can be associated with pathologies like neurodegenerative disorders resulted from protein aggregations and kidney deficiencies. Nonetheless, most of these studies used a limited set of protein structures or a particular family.
The objective of this work is to study the distribution and the conservation of β-bulges in terms of types, local backbone conformation and amino acid composition, in proteins. For this purpose, we superimposed structurally similar proteins using iPBA tool. This methodology is based on the use of Protein Blocks (PBs), which is a structural alphabet composed of 16 prototypes that are 5 residues long and these prototypes were able to approximate a complete protein backbone conformation. This library of local conformations is currently the most widely used structural alphabet. iPBA uses the translation of protein structures as series of PBs, which can be compared using sequence alignment techniques. iPBA outperformed other established methods on a nontrivial benchmark dataset.[30, 31] Based on the superimposed structures, the distribution and conservation of β-bulges in terms of their types, local backbone conformation and amino acid composition were studied.
Analysis of the secondary structures
About 12,132 structures, representing 2,180,241 amino-acids, were used for this study, out of 16,712 structures in the SCOP dataset filtered at 95% sequence identity. The remaining protein chains comprise structures solved by Nuclear Magnetic Resonance, involve nonstandard PDB file formats and those structures for which PROMOTIF failed to assign backbone conformations. Table 2 summarizes the secondary structure assignment for the SCOP95 dataset. Three structural classes that is, α/β, α + β, and all-β represent a quarter of our dataset each, while all-α represents only 16.8% of the protein chains. Secondary structure assignment resulted in 35.1% of residues in α-helical conformation, 18.5% in β-sheets and rest 46.4% in coils. Similar results were found for SCOP40 dataset, and the secondary structure distributions are in agreement with previous studies.[4, 32, 33]
Table 2. Distribution of Secondary Structures and β-Bulges in Different SCOP Classes
nb res (%)
Prot β-bulge (%)
Per str (%)
Are given the percentage of proteins (prot), the percentage of residues (nb res), the percentage of helical, extended and coil residues (α, β, and coil), the number of β-strands (strand nb.), the percentage of protein with at least one β-bulge (prot β-bulge), the percentage of amino acid assigned as a β-bulge, and the average number of β-bulge per strands (per str.) in each SCOP class. SCOP class mult. d. implies multi domains and memb+ correspond to membrane proteins.
all − α
α + β
all − β
β-Bulges represent 3.35% of the residues in the dataset (24,142 β-bulges representing 73,096 residues, see Table 2). They are as recurrent as the PolyProline helix II conformation or the frequent types of β-turns. Chan and coworkers analyzed only 362 β-bulges from 170 proteins. In this work, the dataset used is 66 times bigger, nevertheless a similar distribution of β-bulges was found (see Supporting Information Table S1). Antiparallel β-bulges represent 92.3% of all β-bulges; parallel ones being less frequent (7.7%). On an average, each protein has two β-bulges, while the average occurrence reaches three for the all-β proteins.
As expected, the average number of β-bulges per protein is proportional to the length and the number of β-strands. More than 90.1% of β-bulges were found in structures, which are shorter than 400 residues and having at least 3 β-strands.
Classic β-bulges are the majority with a frequency of 57.0%, followed by G1 β-bulges (32.8%), the other three types being rarer (Wide β-bulges: 4.9%, Special β-bulges: 3.3% and Bent β-bulges: 1.9%). A significant change compared to the previous study, is an increase in the frequency of antiparallel classic β-bulges from 46.7 to 52.6% and a decrease in antiparallel Wide β-bulges, from 8.3 to 3.7%.
PROMOTIF assigns β-bulges according to their conformation and hydrogen-bonding patterns. Hence, these characteristics could also apply to residues localized in a loop, as noticed for G1 β-bulge loop. β-bulges are observed with residues completely localized inside β-sheets, partly outside or completely outside the sheet.
As shown in Table 3 only 54.3% of β-bulges are composed entirely of residues localized inside the β-strands. The rest is mainly represented by antiparallel G1 β-bulges with 98.1% having at least one residue localized outside β-strands, which often corresponds to G1 β-bulge loop.
Table 3. β-Bulge Types
In and Out (%)
Given the observed distribution in the dataset of bent (B), special (S), wide (W), G1 (G), and classic (C) β-bulges, which are in parallel (P) or antiparallel β-sheets (A). The corresponding frequencies of β-bulges, which are composed of residues localized completely in β-strands (In), that do not lie in β-strands (Out) or localized partly in strands (In and Out), are highlighted.
β-Bulges are usually localized close to β-strand extremities, that is, 90% of β-bulges are within 3 residues of a β-strand extremity (see Supporting Information Table S2). If we consider the β-bulges that are composed of residues lying exclusively in β-strands, the bulge residues are rarely localized in the middle of β-strands; this behavior can be observed for all types of β-bulges (see Supporting Information Fig. S1). β-Bulge residues (75.35%) are localized either within a strand or between two consecutive strands (see Supporting Information Table S3).
β-Bulge in SCOP classes
As seen in Tables 2 and Supporting Information S4, the distribution of β-bulges is not similar in all SCOP classes. β-Bulges were even found in the all-α class which by definition has a low β-sheet content. About 30.7% of these β-bulges are entirely found inside β-sheets and are mainly antiparallel G1 β-bulges (54.9%). As α/β class is mainly composed of parallel β-sheets, it is expected to have the highest content of parallel Special, Wide, Bent, and Classic β-bulges (3.4, 5.0, 2.9, and 20.2%, respectively). α + β and all-β classes exhibit roughly the same behavior with the dominance of antiparallel Classic β-bulges (60.5 and 55.2%, respectively), a significant representation of antiparallel G1 β-bulges (31.0 and 35.0%, respectively) and a limited number of β-bulges outside β-strand (∼15%), like α/β class.
The multidomain protein and small protein classes have similar distributions with ∼30% of β-bulges in β-strands, 30% outside β-strands and 38% are partly in β-strands. The membrane associated class has lower number of β-bulges, but has the highest number of antiparallel Wide β-bulge (9.3%, which is twice the frequency in the other classes); the other 5 types of β-bulges were never observed.
Amino-acid preferences of β-bulge
Table 4 shows the amino acid and the Protein Block preferences of the different β-bulges (see Supporting Information Table S5). Aspartate is found over-represented, mainly at positions 1 and 2. For all β-bulge types, a significant preference for the amino acids Glycine, Asparagine, and Proline is seen. Fewer amino acid preferences were found in all parallel β-bulges, and in antiparallel Bent β-bulges.
Table 4. Amino Acid and Protein Block Preferences for β-Bulges
Amino acid preferences
Protein block preferences
For each β-bulge type and for all β-bulge residue positions (X, 1, 2, 3, and 4), the over-represented amino acid and protein block are specified. The data has been normalized with respect to the amino acid and protein block frequencies in the β-sheets. These over-represented amino acids and PBs correspond to positive Z-scores greater than the threshold value of 4.42. The hatched cases correspond to undefined residue positions. The bold and underlined labels indicate new preferences compared to the previous study (Chan et al., 1993).
D, G, N
D, G, N
I, L, R, V, W
D, I, K, L
A, D, E, G, H, K, R, S
c, d, g, j
a, f, g
G, N, P
G, P, S
D, E, G, N, P
D, E, G, N, P
D, E, G, N, P
b, f, h, j
b, i, k
b, i, k
b, i, k
D, K, N
e, f, g
b, h, k
b, h, k
b, h, k
G, N, S, T, W
D, E, G, N, P
D, G, N, P
b, h, i, j
a, b, i
D, E, G
b, d, g, i, j
C, D, N
D, G, N
D, E, K, N, Q, R, S, T
Previous work on β-bulges mainly focuses on three β-bulge types, that is, antiparallel Classic, G1, and antiparallel Wide [see Tables 4-6 of Chan's study]. About 85, 69 and 30 times more β-bulges of each type are identified in this work leading to a new characterization (see bold and underlined labels in Table 4). For the antiparallel Classic β-bulge, four new over-represented residues are observed, Leucine at position X, Asparagine and Lysine at position 1 and Arginine at position 2. However, three over-represented residues reported earlier, had lesser preference in this study, Tyrosine at position X and Valine, Glutamate, and Serine at position 1 (see Supporting Information Table S5). In the case of antiparallel G1 β-bulges, striking differences observed were (a) lesser preferences for Histidine and Serine at position X, (b) over-representation of Aspartate at position 1 and (c) enrichment of three residues at position 2, that is, Serine, Threonine, and Asparagine, which are otherwise common in all β-bulge types.
Finally, the antiparallel Wide β-bulges showed preference for Glycine and Asparagine at positions X and 1, and Serine and Tryptophan at position X. Position 2 had two new over-represented amino acids, namely Glycine and Proline, apart from Aspartate and Asparagine. Another major difference is the absence of preference for Glutamine.
Local conformation preferences of β-bulge
We also investigated the preference for local backbone conformations associated with β-bulges. Protein Blocks is currently the most widely used structural alphabet. It gives a finer description of local backbone conformations when compared with the classical secondary structures. PBs b and i, are found strongly over-represented (21 and 14 fold each) in β-bulges. PB b is found at the N-cap of β-strands and i is frequent in loop region. These two PBs are found in successive positions and strongly mark the irregularity in the β-strand. These PB conformations are not easily altered, as seen in the PB substitution matrix developed in our earlier work.[30, 31] Interestingly, both parallel (minority) and antiparallel (majority) types favored similar Protein Blocks are found overrepresented for both antiparallel (majority) and parallel (minority) types. Residue X is mainly in β-strand conformation reflected by the preferences for PB d, which correspond to the central region of classical β-strand, PBs b and c corresponding to the N cap, and PB f to the C-cap. In our previous studies, we have shown that certain preferential transitions are observed between PBs,[36, 37] that is, PBs (letters) have preferred successions (words). In G1 β-bulge PBs i or p, at position 1, are seen to be followed by PB a, at position 2.
Antiparallel Classic β-bulges were characterized by distinct preferences for amino acids. They also have distinct PBs' patterns with PBs d or h favored at position X. PB h is mainly associated with residues following the end of a β-strand while PB a at position 1 and PBs g and j at position 2 are mainly loop associated PBs. AC β-bulges are mainly found within β-strands (∼75% see Table 3) with a preference to strand termini, an assignment in agreement with the observed PBs, mainly associated to loops.
Analysis of β-bulges in specific protein families, for example, the WD40 family and the immunoglobulin family, has suggested that β-bulges could be more conserved than other parts of protein structures. About 950,793 structure superimpositions were carried out using iPBA program. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could just arise from the physico-chemical properties of proteins favoring certain packing arrangements and chain topologies. The average GDT_TS score is 33.25 with a peak at 31 (see Supporting Information Fig. S2). Even though superimpositions were performed at the level of SCOP fold, a non negligible proportion of structural alignments shares a very low GDT_TS, that is, some structures, classified in same SCOP fold cannot be properly superimposed. Hence, we selected only superimpositions with GDT_TS score better than 15, a threshold already used in a previous study; corresponding to an average RMSD lower than 2.69Å (see Supporting Information Fig. S3), reflecting superimpositions of structures sharing similar global conformation. Consequently, 716,346 superimpositions were selected.
Similarly, a strong correlation exists between GDT_TS and sequence identity, as seen between GDT_TS and RMSD (see Supporting Information Fig. S3). The GDT_TS and the sequence identity have an exponential correlation. As the sequence is less conserved than the structures, we observed that a GDT_TS score of 60 corresponds to a sequence identity of about 35% on an average. We observed 1,665,200 non-superimposed β-bulges and 531,567 aligned β-bulges (full and partial). As certain folds contained more experimental protein structures, the distribution of β-bulges was normalized.
Are the β-bulges conserved among homologous structures?
On an average, a β-bulge has 42% chances of being conserved (superimposed: 30% fully and 12% partially). In majority of the cases, the two β-bulges superimposed share the same type and secondary structure localization. A β-bulge has 33.0% probability to be conserved with the same type, 31.4% with the same secondary structure localization, and 27.4% with both (see Table 5). Conservation of β-bulge with change of types (9%) was observed five times more for partial than for full superimposition (probability of 7.5 and 1.5%, respectively). This result emphasizes that changes in the hydrogen bond pattern of β-bulge goes hand in hand with modification of the local structural conformation. It must be noticed that it often changed the β-bulge type and the superimposition in such case can only be partial.
Table 5. Conservation of β-Bulge Type and Localization of Superimposed β-Bulges
Different β-sheet localization
Same β-sheet localization
The frequencies of β-bulge superimpositions corresponding to the case of conservation of β-bulge type and conservation of β-sheet localization (In, Out, and In and Out), are given.
Superimposition of different type β-bulges
Superimposition of same type β-bulges
In the following sections, the residues, which form the bulge (positions 1–4), are named the bulged residues. A β-bulge has only 13.2% probability to be structurally conserved with the same X-residue, and only 5.1% chance with same bulged residues. For different β-bulge types (except Bent) at equivalent positions, the probabilities to be structurally conserved with the same X-residue decreases to 0.71%, and down to 0.11% for the same bulged residues. Hence, the amino acid composition of a conserved β-bulge in homologous structures is not necessarily similar. However, conserved β-bulges with same bulged amino acid sequence have 91% probability to share the same β-bulge type. Figure 1(a) shows the conservation of bulged residues in regards to the whole sequence. It highlights without any doubt that the β-bulges are not more conserved (on average) than any other regions of the proteins. As expected [Fig. 1(b)], sequence identity variation is correlated with GDT_TS, underlining no specific generic constraints specific to β-bulge. This result is not in contradiction with specific studies that underline the conservation of β-bulge, as we see better structural conservation of β-bulge for alignments with higher GDT_TS score and a higher sequence identity. The probability to observe a conserved β-bulge in homologous structures increases from 38.48% (for sequence identity lower than 35%) to 75.98% (for higher rate), and from 29.81% (for GDT_TS between 15 and 40) to 63.29% (for higher GDT_TS, see Table 6).
Table 6. Influence of the sequence and structural homology of structures
The frequencies of β-bulge superimpositions (Full, Partial, and No superimposition) for two sequence identities ranges, and two ranges of GDT_TS are given. These frequencies represent the influence of the sequence homology, and structural homology on the β-bulge conservation. Details on β-bulges type conservation are also given for each value.
Sequence identity ≤ 35%
Same type β-bulges
Different type β-bulges
Sequence identity > 35%
Same type β-bulges
Different type β-bulges
Same type β-bulges
Different type β-bulges
Same type β-bulges
Different type β-bulges
Stability of β-bulges
To understand the structural significance and stability of β-bulges, we studied a particular case of α-lytic protease (PDB code 1SSX) using Molecular Dynamics (MD) simulations. This protein is 198 residues long and is characterized by a large number of about 15 β-bulges [see Fig. 2(b)]. Three different temperatures were applied during the MD simulations to analyze the stability of β-bulges. Thirty-four different β-bulges have been observed in the different simulations (see Supporting Information Table S6). The 15 β-bulges remain stable and are present at least 90% of the times for the three temperatures. Nonetheless, formation and disappearance of β-bulges (see Supporting Information Fig. S4) and changes in their types were observed. It is mainly due to significant local conformational changes that lead to the loss or gain of hydrogen bonds. Some β-bulges were quite transient (seen <29% of the simulation time) at all temperatures. These transient β-bulges are mainly composed of residues, which are also observed in stable β-bulges. This could be due to the local structural environment that alters the protein flexibility resulting in the β-bulge shift. For example, the two β-bulges found at position 44–51-52 and 45–50-51 correspond to a stable and transient β-bulge respectively, with overlapping positions in the sequence. Interestingly, these putative displacements are accompanied by a change of β-bulge type.
This study is based on the analysis of a larger dataset of β-bulges compared to previous works. We provide a new description of amino acid preferences [see Table 4 and Supporting Information Table S5(a)] associated with β-bulges. Various studies on specific protein families have shown the importance of conserved β-bulges.[40-42] Higher conservation of β-bulges was observed with the increasing degree of homology in terms of both sequence and structure.
However, our results highlight the observation that β-bulges in structurally similar proteins are not necessarily conserved. We found that a β-bulge has only 42% probability to be conserved in structures sharing the same fold. We also showed that different types of β-bulges are conserved among similar structures and transition between β-bulge types is more frequent than expected (around 1 in 10 β-bulges). Nonetheless, we do not observe any significant preferences for certain type to be conserved [see Fig. 1(a) and Supporting Information Table S7]. For instance, different numbers of β-bulges were observed in two structures corresponding to the matrix protein VP40 of Ebola virus[43, 44] [see Fig. 2(a)]. These structures (SCOP domain: d1es6a1 and d1h2ca_) share the same amino acid sequence and are very close in terms of global conformation (GDT_TS = 46.62 and RMSD = 2.06 Å). 3 β-bulges were found conserved between both structures and one was missing in one structure [see Fig. 2(a)]. This observation highlights the possible effects of protein flexibility or crystal packing on β-bulges formation.
Finally, this work highlights that the conservation of β-bulges is not significantly influenced by conservation of the fold, contrary to the speculations in previous studies. However higher is the homology in terms of sequence and structure, higher is the probability to find conserved β-bulges. Molecular dynamic studies on bulged protein allow the observation of stable and transient nature of β-bulges. This behavior needs to be investigated at a greater detail on related proteins to quantify its impact on conservation studies.
Materials and Methods
Overview of the method
Figure 3 shows the workflow for the analysis of β-bulges. Protein structures were taken from the ASTRAL SCOP dataset [Fig. 3(a)]. Secondary structures, including β-bulges were assigned by PROMOTIF software [Fig. 3(b)]. Protein structural domains containing at least one β-bulge were superimposed with protein structural domains belonging to the same SCOP folds. Pairwise structural alignments were performed using iPBA [Fig. 3(c)]. The conservation of β-bulges at structurally equivalent positions were then analyzed [Fig. 3(d)].
Two sets of protein structures were extracted from Protein Data Bank based on the ASTRAL SCOP dataset, filtered at 40% and 95% sequence identity. The proteins were classified into folds and classes based on the SCOP classification. All NMR structures were excluded from the analysis. SCOP95 dataset contained 16,712 structures representing 1,195 folds and 7 classes.
Analysis of local backbone conformation
Secondary structures have been assigned using PROMOTIF software; it is based on DSSP methodology and used backbone hydrogen bond patterns. PROMOTIF also gives assignments of different types of turns and β-bulges. The assignment of β-bulges is based on Chan et al. classification that defines five main types Classic (C), Bent (B), Wide (W), G1, and Special (S). PROMOTIF is used for the secondary structure assignment in PDBsum and it is the only currently available software that allows distinguishing and assigning the different types of β-bulges.
Protein Blocks (PBs) were also used to have a finer and different view of the local backbone conformation. They correspond to a set of 16 pentapeptide conformations, labeled from a to p, described as a series of (φ,ψ) dihedral angles[29, 51] (see Supporting Information Table S8). This library was obtained by clustering all pentapeptide conformations using an unsupervised classifier similar to Kohonen Maps[52, 53] and Hidden Markov Models. The PBs m and d can be roughly described as prototypes for the central region of α-helix and β-strand, respectively. PBs a, b, and c primarily represent the N-cap of β -strand while e and f correspond to C-caps. PBs g to j are specific to coils. k and l correspond to N cap of α-helix while PBs n to p are associated with C-caps. This structural alphabet of 16 prototypes allows a reasonable approximation of local protein 3D structures with an average root mean square deviation (RMSD) of about 0.42 Å.
Abstraction of structures in terms of PBs helps to encode 3D information into a 1D sequence.[29, 37, 51] We used classical amino acid sequence alignment strategies to align PBs sequences used to compare protein structures.[55-57] The alignment approach was refined with the use of an anchor-based dynamic programming algorithm, which first identifies all high scoring and structurally favorable local alignments (anchors). The segments between these anchors are then aligned to obtain a global alignment. This improved PB based structure alignment approach, namely iPBA, outperformed other established methods as seen with different robust benchmark datasets.[30, 31] ProFit (version 3.1) is used to obtain the final 3D superimposition of two protein structures (based on the PB-based sequence alignment). ProFit performs least squares fit of protein structures based on the residue equivalences in a given sequence alignment. Only Cα atoms of equivalent residues were used for the calculations.
iPBA provides two measures of the structural superimposition: the RMSD, and the global distance test total score (GDT_TS defined by Zemla in 2003). This latter varying between 0 and 100, and reflects the global similarity of two protein structures. It was used for model assessment in the last rounds of critical assessment of techniques for protein structure prediction.
Amino acid sequence alignment was performed using CLUSTAL-W (version 2.1). Default parameters were used with Gonnet substitution matrix.
Characterization of β-bulge superimposition
Proteins classified into the same fold were superimposed with iPBA. The structure based sequence alignment generated by iPBA was used to evaluate the structural conservation of β-bulges. Based on this alignment output, we can distinguish three cases of β-bulges superimposition (Fig. 4):
Full Superimposition: All the residues of first β-bulge are aligned with those in the second β-bulge. [see Fig. 4(a)], the number of aligned residues correspond to the number of residues of the shorter β-bulge.
Partial Superimposition: only some residues of the smaller β-bulge are aligned with the residues of the other β-bulge [see Fig. 4(b)].
Non Superimposed β-bulges: A β-bulge found in one protein structure is not aligned with a β-bulge on the other protein [see Fig. 4(c)].
Amino acid and PB composition
Occurrence of each amino acid and Protein Block in a particular β bulge type, have been normalized into a Z-score:
with the observed occurrence number of amino acid or PB i in position j (residue position X, 1, 2, 3, or 4) for a given particular β-bulge type and , the expected number. The expected values correspond to the product of the occurrence in position j and the frequency of amino acid i in the entire databank (or from β-strands of the databank). Positive Z-scores correspond to over-represented amino acids or PBs, and the threshold values of 4.42 and 1.96 were chosen to indicate the level of significance (P-value <10−5 and 5.10−2, respectively). This measure was used in our previous studies to analyze the amino acid representation in Protein Blocks.[37, 63]
MD simulations were performed with GROMACS 4.5.4[64-67] using Amber 03 force field for proteins and the explicit TIP3P solvent model for water molecules was used. The structure was immersed in a water box with periodic boundary conditions and neutralized with Na+ or Cl− counterions. The energy of each system was then minimized with a steepest-descent algorithm for 2000 steps. MD simulation was performed in NPT ensemble, with temperature and pressure kept constant, at three different temperatures (298, 310, and 353 K) and 1 bar pressure using Berendsen algorithm. The coupling time constants were τ = 0.1 ps and τ = 4 ps for temperature and pressure, respectively. Bond lengths were constrained with the LINCS algorithm, which allowed an integration step of 2fs. The Particle-mesh Ewald summation was used to handle long-range electrostatic interactions using a cut-off of 1.4 nm for nonbonded interactions. An equilibration step was first performed for 500 ps, with protein atom positions constrained while ions and water molecules were free to move, followed by an unrestrained production step of 50 ns. The coordinates were recorded at every 10 ps interval. The MD simulation was checked and analyzed using Gromacs tools.