Large-scale analysis of somatic hypermutations in antibodies reveals which structural regions, positions and amino acids are modified to improve affinity



The principles of affinity maturation of antibodies (Abs), which underlies B cell-mediated immunity, are still under debate. It is unclear whether the antigen (Ag) binding site is a preferred target for mutations, and what the role of activation-induced deaminase (AID) hotspots is in this process. Here we report a structural analysis of 3495 residues that have been replaced through somatic hypermutations (SHMs) in 196 Abs. We show that there is no correlation between the propensity of an amino acid to be in AID hotspot and the probability that it is replaced during the SHM process. Although AID hotspots may be necessary to enable SHMs, they are not a major driving force in determining which residues are mutated. We identified Ab positions that are highly mutated and significantly affect binding. The effect of mutation on binding energy is a major factor in determining which structural regions of the Ab are mutated. There is a clear preference for mutations at the Ag-binding site. However, positions outside this region that also affect binding are often preferred targets for SHMs. As for amino acid preferences, a general trend during SHM is to make Ab–Ag interfaces more similar to protein–protein interfaces in general. In different regions of the Ab, there are different sets of preferences for amino acid substitution. This mapping improves our understanding of Ab affinity maturation and may assist in Ab engineering.




antigen-binding region




complementarity-determining region


somatic hypermutation


heavy chain variable domain


light chain variable domain


B cells are activated during exposure to pathogens, and produce antibodies (Abs) that bind specific antigens (Ags). The initial repertoire of germline Abs is generated by rearrangement of the V(D)J gene segment [1]. These Abs are the first responders to the Ag, and are believed to bind Ag with low affinity [2]. Improvement of affinity occurs in the days after the initial exposure through introduction of high-rate base changes in the Ab sequence, known as somatic hypermutations (SHMs), and selection of B-cell clones that have better affinity toward the Ag [3]. The SHM process enables development of an efficient secondary response and immunological memory, which is key to development of B-cell immunity. Investigating SHMs is therefore essential for understanding the immune system. It may also guide Ab engineering, thus improving development of Abs as research, diagnostic and therapeutic agents.

The molecular mechanisms that underlie SHMs have been the focus of extensive research. The introduced mutations are predominantly point mutations and rarely base insertions or deletions [4, 5], and are mediated by the activation-induced deaminase (AID) enzyme [6, 7]. AID introduces diversity by converting cytosine to uracil, which activates error-prone DNA repair mechanisms [6, 8, 9]. Cytosines located within DNA motifs that are preferred binding targets of the AID enzyme are commonly referred to as hotspots [10]. However, not all of the hotspots are targeted [11], and many SHMs occur near hotspots but not within them [12]. The assumption that AID plays an important role in the SHM process inspired attempts to utilize it in vitro, e.g. by coupling mammalian cell-surface display with AID-directed SHM [13] or by designing phage display libraries based on DNA hotspots [14].

Studies that have attempted to characterize SHMs structurally mostly involved analyses of the crystal structures of one or a few pairs of germline and mature variants of a specific Ab in order to determine how structural factors affect affinity enhancement. In one such study, examination of the X-ray crystal structures of four anti-lysozyme Ab variants at various maturation stages revealed that binding is enhanced by burial of increasing amounts of an apolar surface area and by improving shape complementarity [15]. However, analysis of another set of Abs found that the mature Ab does not have better shape complementarity to the Ag than its germline variant, but exhibits a small improvement in shape complementarity between the variable light (VL) chain and the variable heavy (VH) chain, and has a higher electrostatic contribution to Ag binding than that of the germline Ab. [16]. The X-ray structure of an anti-hapten Ab and its corresponding germline Ab suggested that, in this case, the increased affinity is achieved mainly by electrostatic optimization [17]. Several studies used molecular dynamics simulations of a handful of mature Abs [18] or a specific Ab lineage [19, 20], and reported that rigidification of the paratope leads to a reduction in the entropic cost of the interaction.

The studies that have examined whether SHMs are focused on residues involved in Ag binding reached contradictory conclusions. Clark et al. [12] identified SHMs in over 11 000 Ab sequences. They reported that Ag-contacting positions are mutated three times more often than core residues. However, in this analysis, interface positions in the Ab sequence were defined as Ab positions that are within 12 Å of an Ag atom in any PDB structure, a definition that covers mostly residues that do not physically interact with the Ag. SHMs and hotspots were reported to be over-represented in the complementarity-determining regions (CDRs) [12, 21]. However, while CDRs cover ~ 80% of the Ag-binding residues, 50–60% of the residues in the CDRs do not contact the Ag [22]. Several studies indicated that SHMs mostly occur in the periphery of the germline Ag-binding site and not in its center [23, 24], and that SHMs do not show a clear preference toward residues that are in contact with the Ag [25, 26]. It has even been suggested that mutations in the interface may be disfavored as they disrupt Ab–Ag interaction [25, 27].

In this study, we wished to determine the principles that guide in vivo Ab affinity maturation. In particular, we attempted to identify factors that determine which residues are removed and which new ones are introduced during the SHM process. Given the controversies regarding the tendency of the paratope to undergo SHM, we attempted to determine whether different structural parts of the Abs have different tendencies for substitutions. To this end, we analyzed 3495 SHMs in 196 structurally characterized Ab–Ag complexes, and examined (a) the role of AID hotspots in directing mutations, (b) the selective pressure for substitutions in different structural regions of the Ab, and (c) the predicted energetic effect of each substitution. We found that AID motifs have no effect on selection of mutated residues, but the energetic contribution to Ag binding appears to have a major effect. Finally, we produced a map of the preferred substitutions in each region of the Ab. These results contribute to understanding of the principles that govern the SHM process, and may guide the design and engineering of high-affinity Abs.


Dataset construction and SHMs identification

A non-redundant dataset of 196 Ab–Ag complexes was generated (Table S1). Overall, 3495 SHMs were identified in the variable regions. Of those, 2172 occurred in mouse sequences (with a mean of 14.87 mutations per Ab) and 1323 occurred in human sequences (with a mean of 26.46 mutations per Ab). This difference may be ascribed, at least in part, to the way Abs are collected from mice and humans. The former are typically killed, and Abs collected, shortly after exposure to the Ag when they are a few months old. Human Abs, on the other hand, are typically collected from the blood of infected adults after repeated exposures to Ags.

AID hotspot motifs are not correlated to SHMs

As only the amino acid sequences of the mature Abs are available in the Protein Data Bank, it is impossible in most cases to retrieve the DNA sequences of the mature Ab from public databases. However, it is possible to retrieve the DNA sequences of the germline genes. These sequences allow us to evaluate the relationships between SHMs and AID hotspot motifs (RGYW or WRCY; R indicates a purine base, Y indicates a pyrimidine base, W indicates an A or a T base) [10] in the germline genes.

Figure 1A shows how often a certain amino acid overlaps with an AID hotspot motif versus how often it is actually mutated during SHM. The calculated correlation coefficient is −0.0127, indicating that amino acids that hit hotspot motifs more often, are not more likely to undergo SHM. This is most extreme in the case of methionine and aspartic acid, which are the least frequent amino acids in AID hotspots and have more mutations than AID sites. We also mapped the location of mutations in the V gene to their positions in the germline genes. Then, we calculated the distance of each mutation and each residue in all Ab V genes from the nearest hotspot motif. This was previously performed by Clark et al. [12] for a set of ~ 11 000 Ab sequences. Figure 1B shows the distribution of mutations at different distances from hotspots. Our results are very similar to the previously published results [12]. Based on these results, it has been previously suggested that mutations are more likely to occur in positions that are located closer to a hotspot motif. However, we added a control to this analysis by checking the distance of codons from the nearest hotspot motif for residues that were not necessarily substituted in SHMs. We found that the typical distance of a residue from a hotspot is very similar whether it has been mutated during SHM or not, suggesting that the distribution of hotspots along the sequence is such that any codon encoding an amino acid is more likely to be located near or within a hotspot than to be distant from one. However, Fig. 1B shows that position 0 has a slightly higher value for residues that underwent SHM (gray line) compared to other residues (black line), indicating that residues that have been mutated are slightly more likely to have codons that overlap with an AID hotspot. However, this slight preference explains only a negligible proportion of the SHMs: 13% of residues that have been mutated overlap with AID hotspots, compared to 9% of all residues. This observation suggests that hotspot motifs may be viewed as an enabler of SHMs, but that other factors are involved in determining which mutations survive clonal selections.

Figure 1.

Hotspot motifs and occurrence of SHMs. (A) The fraction of the germline occurrences of an amino acid that are mutated during SHM versus the fraction of its occurrences that fall within a DNA hotspot motif is shown in a scatter plot. The y value for each data point is the proportion of a specific amino acid in a hotspot motif (RGYW or WRCY). The x value for each data point is the proportion of a specific amino acid that was mutated during the SHM process. The proportions in both cases are calculated relative to the total number of the specific amino acid in the germline sequences (V and J segments) of the 196 Abs in the dataset. (B) The distance of the middle codon from the nearest hotspot motif was calculated (see 'Experimental procedures') for each amino acid or mutation up to position 105 (according to IMGT numbering) in the V gene. The distances for mutations (gray line) or amino acids (black line) in the V gene are presented in a histogram using bins of 1 base wide.

SHMs occur more in heavy chains, but light chain SHMs are as important energetically

We assessed the energetic effect on the binding of the Ag for every mutated residue in the Ab by mutating it back to its germline amino acid (in silico) and predicting the effect of this mutation on the ∆∆G of binding. The calculations were performed using FoldX [28], which uses parameters and weights derived from experimental data from a large number of mutations. Large-scale assessments of the energetic predictions by FoldX for 1030 mutants [29] have shown them to be strongly correlated (R = 0.83) with experimentally measured effects. Thus, while FoldX may not always provide individual accurate predictions, it may be trusted to reveal trends in large sets of mutations. Half (51%) of the SHMs had predicted ∆∆G values of 0, suggesting that they have no effect on binding, while 32% of the SHMs had positive ∆∆G values and only 17% had a negative ∆∆G values, indicating that, as expected, mutating mature residues back to their germline amino acids is hampers Ag binding more often than improving it. The distribution of ∆∆G values for SHMs in the VH domain is almost identical to that of SHMs in the VL domain (Fig. S1). However, 63.3% of SHMs occur in the VH domain. As the size of both domains is virtually identical, we conclude that there is a preference for SHM in the heavy chain, but each individual mutation has a similar effect regardless of the chain in which it occurs.

The Ag combining site has the highest SHM propensity

We divided the Ab into five non-overlapping structural regions (Fig. 2A): (a) ‘Ag interface’, which includes residues that contact the Ag, (b) ‘VH–VL interface’, which includes residues on each chain that contact the other chain, (c) ‘both interfaces’, which includes residues that are implicated in both Ag and VH–VL interfaces, (d) antigen-binding region (ABR) residues that are not in contact with the Ag, and (e) ‘other residues’ (Experimental procedures). ABRs are stretches of the six hypervariable loops that roughly correspond to the CDRs [22, 30], but cover more of the Ab–Ag interface [22]. For each of the five regions, we predicted the energetic effect of each SHM on binding by mutating each SHM residue back to its germline amino acid. The strongest energetic effect was observed in residues in the Ag interface and in both interfaces (Fig. 2B). However, mutations to the VH–VL interface and mutations to the ABR residues that are not in interfaces still affect binding energy more than mutations in other areas of the Ab. Thus, although these mutations do not occur in the binding site per se, they contribute to the binding energy. We also assessed the propensity of SHMs in these five structural regions. First, we calculated the percentage of residues in each region out of the residues in the variable regions (Fig. 2C) and the percentage of SHMs (% mutations) in each region out of all SHMs (Fig. 2D). For a given region, the mutation propensity (Fig. 2E) was calculated as:

display math
Figure 2.

Mutation propensity and energy contribution of the various Ab structural regions. (A) The Ab residues were divided into their structural regions, as demonstrated by coloring of a representative structure of Ab 1F9 against turkey egg white lysozyme C. The image was generated from PDB ID 1DZB using Discovery Studio Visualizer (Accelrys, San Diego, CA). The Ag is shown as an orange ribbon. The Ab variable region is shown also in ribbon representation, colored according to the structural group: Ag interface in blue, VH–VL interface in red, both interfaces in green, ABRs that are not in interfaces in purple, and other residues in gray. (B) The ∆∆G values for substitution of each SHM back to its germline amino acid were calculated using FoldX (see 'Experimental procedures'). The ∆∆G values are presented in a histogram using bins of 1 kcal·mol−1 wide, with each region colored as described above. (C,D) The percentage of residues in each region out of all residues in the variable regions (C), and the percentage of SHMs in each region out of all SHMs (D) are shown in pie charts. (E) The mutation propensities for the various regions were calculated as the log of the ratio of the percentage of mutations in the region to the percentage of amino acids in the region.

where ‘r’ represents one of the five structural regions. If there is no preference for mutations in one region, the value of Pr for that region is 0. This propensity may be used to assess the selective pressure on each of the structural regions defined. Consistent with previous reports [25, 26], we found that most of the mutations (71.63%) occur outside the Ag-binding site (Fig. 2D), with 18.55%, 13.75% and 39.33% of the mutations being introduced into the regions ‘VH–VL interface’, ‘ABRs not in interfaces’ and ‘other residues’, respectively. However, 87.75% of the Ab residues in the variable region do not contact the Ag. Thus, when normalizing to the relative sizes of these regions (Fig. 2C), we found that the strongest propensity for SHMs is in fact for the Ag interface and for residues in both interfaces. These regions account for 12.25% of the Ab residues but for 28.36% of the SHMs. For ABR residues that are not in interfaces, a lower but significant positive propensity is observed. The VH–VL interface has SHM probability values slightly above zero. Two-fifths (39.3%) of the SHMs occur in ‘other residues’, which cover 59.8% of the Ab. Thus, there is a negative preference for SHMs in positions that are not in the first four regions defined above. The results in Fig. 2B,E imply that the propensity for SHM and the predicted energetic contribution are correlated, as a correlation coefficient of 0.8 was calculated between the mutation probabilities and the mean ∆∆G values of SHMs in each region.

Germline residues account for most of the binding of the Ag

To determine which contacts contribute more to Ag binding, i.e. those that are formed by the residues mutated during SHM (‘SHM contacting residues’) or those that are formed by residues retained from the primary germline sequence (‘germline contacting residues’), we compared their predicted energetic contribution by mutating each contacting residue to alanine and calculating the effect of this mutation on binding energy (see 'Experimental procedures'). The results are shown in Fig. 3. Only 18% of the contacting residues in the mature Abs were the result of SHMs (Fig. 3A). However, the distribution of the energetic contribution of these residues is almost identical to that of germline residues that make contact with the Ag (Fig. 3B). We conclude that Ag binding is accounted for in large part by the germline Ab sequences. SHM may fine-tune this interaction by adding some contacts with similar energy distribution. It is possible that some SHMs also induce conformational changes that allow more germline residues to contact the Ag, thus improving affinity.

Figure 3.

SHM contacting residues, germline contacting residues and protein–protein interfaces. The data for SHM contacting residues are shown in dark gray (A–C), those for germline contacting residues are shown in light gray (A–C) and those for protein–protein interfaces are shown in white bars (C). (A) Binding site composition according to the origin of the contacting residues. (B) The ∆∆G values for substitution of each contacting residue by alanine were calculated using FoldX (see 'Experimental procedures'). The ∆∆G values are presented in a histogram using bins of 1 kcal·mol−1 wide. (C) For the amino acid composition, the amino acids are listed on the x axis and the y values are the amino acid frequency in the contacting residues. Error bars represent the standard error. (D) The similarity between the amino acids compositions was calculated using Jensen–Shannon divergence.

SHMs make the Ab–Ag interface more similar to other protein–protein interfaces

We compared the amino acid composition of SHM contacts and germline contacts with those of general protein–protein interfaces. All aliphatic hydrophobic amino acids (alanine, isoleucine, leucine, methionine, proline and valine) are under-represented in the Ab–Ag interface compared with general interfaces (Fig. 3C). However, SHMs increase the representation of aliphatic residues in the interface compared to the germline. Tyrosine, serine and tryptophan were previously reported to be abundant in Ab paratopes [31, 32]. They are highly over-represented in the germline contacting residues (19.35%, 12.63% and 5.95%, respectively) but much less so in SHMs (5.53%, 8.18% and 0.71%, respectively) and in protein–protein interfaces (4.19%, 6.66% and 1.53%, respectively). Our results corroborate previous findings [12] showing that this over-representation is already encoded in the germline sequence. However, SHM slightly decreases this over-representation, bringing the mature interface composition closer to that of general protein–protein interfaces. Although the energy contribution of both types of Ag contacting residues is similar, their amino acid composition is remarkably different. Asparagine, phenylalanine and arginine are abundant in contacts arising during SHM, while tyrosine, serine and tryptophan are abundant in the germline contacts. We assessed the similarity between the amino acid compositions of these three types of interfaces using Jensen–Shannon divergence [33] (Fig. 3D). Samples that come from the same distribution have a Jensen–Shannon divergence that is close to 0, and the Jensen–Shannon divergence increases as the differences in the compared distributions increase. The largest Jensen–Shannon divergence was found between germline contacting residues and general protein contacting residues (0.117). The greatest similarity was found between protein–protein interfaces and SHM contacting residues, with a Jensen–Shannon divergence of 0.054, which is smaller than the Jensen–Shannon divergence between SHM contacting residues and germline contacting residues (0.077). Thus, although germline contacts differ substantially from general protein–protein interfaces, SHM contacts, which are more similar to general protein–protein interfaces, make the final composition of the mature Ab interface more similar to protein–protein interfaces, with a Jensen–Shannon divergence of 0.0973.

Structure and function drive the propensity for mutation

To understand the role of different amino acids in SHM and the differences between the structural regions, we further analyzed the propensities for mutation in germline amino acids during SHM. As shown in Fig. 4, alanine and serine are mutated more than expected by chance across all structural regions, while glycine, proline and leucine are mutated less than expected. Alanine, methionine and valine are the only aliphatic hydrophobic amino acids that are mutated significantly more than expected by chance. This enrichment holds for valine only in the VH–VL interface and for methionine in all structural regions except ‘both interfaces’.

Figure 4.

Propensities of amino acids to be mutated during affinity maturation. The ‘propensity to be mutated’ (see 'Experimental procedures') for each amino acid in the various structural regions is shown. Error bars represent the standard error. The structural regions are colored as follows: Ag interface in blue, VH–VL interface in red, both interfaces in green, and ABRs that are not in interfaces in purple.

All polar amino acids show a very distinct preference across these four structural regions. Tyrosine, which is highly important in Ag binding due to its over-representation in Ab ABRs [34], is actually a preferred target for substitution in ABRs residues that are not in interfaces and in the VH–VL interface. The only exception is the Ag interface, in which tyrosine is slightly protected from substitutions. Threonine, which has also been suggested to be over-represented in Ag interfaces [31], is mostly neutral to mutation, but is mutated less than expected in the VH–VL interface. Tryptophan is a slightly preferred target for mutation among the residues that are part of both interfaces, and is highly under-mutated in all other regions. Asparagine and glutamine show opposite patterns. While asparagine is over-represented, glutamine is under-represented in both the VH–VL interface and ABRs that are not in any interface. Asparagine also has high mutability in both interfaces, and glutamine is mutated less than expected in the Ag interface. As for the charged amino acids, arginine shows a negative propensity for mutation in the VH–VL interface and in both interfaces. Lysine shows a positive propensity for mutations in ABRs that are not in interfaces. Glutamic acid, aspartic acid and histidine are all less mutated than expected in the Ag interface and in both interfaces.

Five amino acids account for 49% of mutations in the Ag interface region

Figure 5 shows the amino acid composition of the residues that are introduced during SHM. The amino acid composition for SHMs in each structural region was calculated as the percentage of ‘Mutations to AA1’ out of the ‘Mutations in the regions’. As ‘Mutations to AA1’ is the number of mutation to a specific amino acid and ‘Mutations in the regions’ is the total number of mutations in the structural region. Different factors may affect the frequency of introducing a certain amino acid into the sequence of the Ab, such as codon redundancy, number of base changes required to introduce a new residue, the frequency of the original codon in germline sequences, the frequency of the amino acid within all protein sequences, the probability of the substitution in general, and even codon usage. As shown in Fig. 5, which presents the raw frequencies of substitutions within each region, there are significant differences for many residues in terms of their propensities to be introduced into the different regions. Figure S2 shows the same frequencies normalized by the number of codons each amino acid has in the genetic code. Interestingly, asparagine, aspartic acid and phenylalanine remain highly favored, and tryptophan and cysteine remain the most disfavored.

Figure 5.

Amino acid composition of SHMs in the various structural regions. The amino acid composition of newly introduced residues was calculated as described in the text. Error bars represent the standard error. The structural regions are colored as follows: Ag interface in blue, VH–VL interface in red, both interfaces in green, and ABRs that are not in interfaces in purple.

The propensities for substitutions in Fig. 5 show that, in all regions, asparagine is introduced more than glutamine. Aspartic acid is introduced more than glutamic acid in all regions except the VH–VL interface. This may be due to the smaller size of aspartic acid and asparagine compared with glutamic acid and glutamine. Histidine, lysine and proline are introduced into all regions rather moderately. Valine and isoleucine are commonly introduced only in ABR positions that are not in interfaces. Alanine is introduced often into the VH–VL interface and into ABRs that are not in interfaces, but substantially less into the Ag interface. Phenylalanine, glycine, asparagine, arginine, serine and threonine are popular additions to all structural regions. The VH–VL interface, which is made up of two interacting β sheets, is rich in hydrophobic or short polar amino acids (phenylalanine, serine, threonine, alanine, leucine and glycine) that are introduced during the SHM process. When focusing only on the Ag interface, the most frequent substitutions are asparagines. Other common substitutions in the Ag interface are arginine, serine, threonine and aspartic acid. These five polar amino acids account for 49% of mutations in the Ab–Ag interface. Glycine and phenylalanine are the next most prevalent, probably due to the small size of glycine [35] and the structure similarity between phenylalanine and tyrosine, an amino acid that is highly represented in the germline sequence (37.5% of mutated tyrosine are substituted by phenylalanine).

Mutation probability and energy contribution reveal promising positions for affinity enhancement

Rational design of high-affinity Abs requires targeting of Ab positions for mutations. Our analysis identifies such positions based on in vivo SHM data. Figure 6 shows the probability of mutations for each Ab position (according to IMGT numbering for the V domain [36]) and the mean contribution to binding energy of the SHMs in these positions across all Abs in the dataset. For CDRH3, it is not feasible to identify the germline sequence, as it contains a variable number of residues that originate from the D gene fragment. Thus, the data for this CDR do not include the D regions. SHMs are enriched in the CDRs and their vicinity (see also Fig. S3). This observation is in agreement with previous studies [12, 23] and consistent with the fact that ~ 80% of the Ag-binding residues are within the CDRs [22]. However, an additional region with high mutation probability was found between residues 80 and 87 in the human VH domain (Fig. S3). This is consistent with previous reports on the so-called CDRH4 that was suggested to exist in this area [26, 37]. Positions 80–87 in the VH domain form a loop (Fig. S4) similar to the CDRs, accounting for 1.36% of the human Ab–Ag interactions and 0.3% of the mouse interactions. This is in agreement with the high SHM probability that we observed in this region in human sequences but not in mouse sequences (Fig. S3).

Figure 6.

Mean ∆∆G value and mutation probability for each Ab position. Ab positions are numbered according to the IMGT numbering for the V domain [36]. The Ab positions in the VH domain (A) and the VL domain (B) are indicated on the x axis. The mutation probability is represented by asterisks, and was calculated as the number of mutations in a specific position divided by the number of appearances of any amino acid in this specific position. If for a given position, the number of appearances of any amino acid was ≤ 5, it was excluded from the figure. The mean ∆∆G values for each position was calculated from the ∆∆G values for substitution of each SHM in the relevant position back to its germline amino acid. The mean ∆∆G value is represented by gray bars, with error bars indicating standard error. The CDR positions according to IMGT definitions [36] are enclosed in gray boxes.

The regions in the Ab that have high average ∆∆G values for mutating their residues back to the germline amino acids overlap to some extent with regions that have a high mutation probability. However, not all CDR positions undergo substitutions that contribute to binding. For example, CDRH2 (VH positions 56–65) has high mutation probabilities for most of its residues. However, positions 63 and 65 have, on average, no energetic effect on binding despite their high probability for mutations. Positions that are frequently mutated and also show a substantial effect of SHMs on Ag-binding energy, such as 38, 55, 57, 59, 112, 113 and 114 on the VH domain and 110 and 116 on the VL domain, may be promising targets for in vitro affinity enhancement.


Many of the insights into the structural basis of in vivo affinity maturation were obtained from analyses of SHMs in a single pair, or in several pairs, of germline and mature Abs [15-20, 38]. Large-scale studies that attempted to elucidate the principles that guide SHM reached contradictory conclusions regarding preference for SHMs in the Ab–Ag interface [12, 21, 25, 26]. Our division of the Ab into various structural regions, and the calculation of mutation probability and the energy effects of SHMs in each region, reveal that the highest propensity for SHMs is in Ag-binding regions (Ag interface and both interfaces). These regions also provide the greatest energetic contribution to Ag binding. These results are consistent with the selection of B cells based on Ag binding and with previous studies that showed fine-tuning of the Ag-binding site through SHMs [15, 17]. Although to a lower extent than the regions involved Ag binding, ABR residues that are not in the interfaces and residues in the VH–VL interface are both favored targets for mutations and make a substantial energetic contribution to Ag binding. This is consistent with previous studies that showed how internal interface stabilization [38] and increased VH–VL interface shape complementarity [16] result in enhanced Ag binding.

DNA motifs that enhance targeting of the AID enzyme have been the focus of many studies that attempted to identify SHM sites. Such DNA hotspot motifs were previously suggested to play an important role in the formation of SHMs [10]. However, our results indicate that the mature Ab sequence is determined by the affinity and possibly the stability of the Ab. The lack of correlation between the extent to which an amino acid is located within hotspots and its frequency among mutated positions suggests that structural and functional considerations play a much more important role than the presence of AID hotspots.

Our analysis of SHM, germline and general protein–protein interfaces suggested some evolutionary insights. Tyrosine and tryptophan, which are large, flexible, amphipathic amino acids, were previously suggested to be highly represented in the Ag interfaces, and have been proposed to allow binding of several structurally similar Ags [39]. However, the affinity maturation process decreases their representation and increases the representation of aliphatic hydrophobic amino acids. Both SHM contacts and protein–protein contacts are the result of specific evolution and optimization of contacts, while germline-Ag contacts occur between partners that have never met before. This may explain the abundance of germline interface residues that may form several different kinds of contacts, and also the higher similarity between protein–protein interfaces and SHM contacting residues. This observation is consistent with a previous study that suggested that Ab affinity maturation and protein–protein interface evolution are guided by similar principles [40].

The ∆∆G values in this study were predicted by FoldX [29]. While there may be other tools that allow energetic assessment of individual mutations, FoldX enables rapid assessment of a large number of SHMs. An independent assessment has shown that FoldX is particularly good in assessment of the energetic effect of mutations to amino acids other than alanine and mutations of residues located in loops [41]. Previous studies have shown that FoldX may be used to identify trends in the evolution of protein function [42, 43]. Furthermore, it has recently been used for the study Ab–Ag interactions [22, 34]. The FoldX energy function also includes scoring parameters for the entropic cost of mutation. However, these parameters are calculated based on theoretical data and have been acknowledged to be a crude estimation of the entropy [28]. It has been shown that loss of flexibility in the Ab paratope and thus a lower entropic cost of the interaction is an important aspect in Ab affinity maturation [18-20]. Quantification of such effects requires long molecular dynamics simulations or experimental procedures. Such methods are not applicable for a large number of Ab–Ag complexes, thus the estimation of paratope rigidification is beyond the scope of this study.

The Ab–Ag dataset we used consists of 196 non-redundant Ab–Ag complexes. As more Ab–Ag complexes become available, it will be possible to also apply this approach to Ab–hapten interaction, which is currently not practical, and even to the interfaces with specific Ags such as gp120 or hemagglutinin, to elicit SHM patterns that are unique for that Ag. For example, it has recently been shown that Abs that broadly neutralize HIV are characterized by a remarkably high number of SHMs [44-46] and may require also SHMs in their framework regions [47].

Over recent decades, Abs have become one of the most effective and popular tools in biotechnology and biomedicine [48], and more than 30 Abs and Ab derivatives have been approved for therapeutic use by the US Food and Drug Administration [49]. Therapeutic and diagnostic Abs frequently require engineering to enhance the affinity of Abs raised in immunized animals or selected by library screens. This step is important to expand detection limits, extend dissociation half-life, decrease drug dosage and increase drug efficiency [50]. The structural and biophysical principles identified here may allow more focused in vitro design of Abs with enhanced affinities.

Experimental procedures

Ab–Ag complex dataset construction

3D structure files of 752 Ab–Ag complexes were downloaded from IMGT/3Dstructure-DB [51, 52] (version 4.5.0). Complexes with Abs from human (154 structures) or mouse and chimeric Abs (492 structures) were retained. Abs from mouse and chimeric Abs were grouped as mouse Abs. To identify the light and heavy chains in each complex, we clustered human sequences into two clusters and murine sequences into two clusters, each corresponding to either heavy or light, using BlastClust [53]. Complexes that included only one chain and light chain dimers were removed. For redundancy removal, VH and VL sequences of each Ab were concatenated, and BlastClust was used with sequence identity of 97% and coverage of 95%. The Ab–Ag complex that was not engineered or mutated was the selected representative sequence in each cluster. In cases where there was more than one non-engineered complexes, the longest Ag with the best resolution was used. We used the ‘ligand category’ option from IMGT/3Dstructure-DB to identify Ags that are proteins or peptides. All other Ags were removed. One complex (PDB ID 1IGC) in which the sole non-Ab chain was protein G was also excluded from the analysis. In case where the closest gene in IMGT did not agree with the annotated species, we reviewed the relevant literature, which led to exclusion of 12 complexes from the analysis: six of these cases were humanized Abs, five of them came from non-naive synthetic libraries and one came from rabbit. Overall, the dataset contained 196 non-redundant Ab–Ag complexes.

Identification of germline precursors and SHMs

IMGT DomainGapAlign [51] was used to identify the related germline gene precursors and identify SHMs. Only variable regions were analyzed. Human and mouse sequences were submitted separately. Default parameters were used. The CDRH3 and CDRL3 alignments were manually reviewed and corrected accordingly. Similar results were obtained when the analysis was repeated after removing junction positions (positions 106–116 for the VH domain and positions 115 and 116 for the VL domain).

Definition of SHM contacting residues, germline contacting residues and protein–protein interfaces

For each complex structure in the protein–protein dataset (fully described previously [34]), the interface of a given chain included all residues in that chain for which at least one of their heavy atoms is within a distance of 6 Å from any of the other chains [54]. The interface residues in all the chains in the protein–protein dataset were grouped as ‘protein-protein interfaces’. For each complex structure in the Ab–Ag dataset, the contacting residues included all residues for which at least one of their heavy atoms is within a distance of 6 Å from the Ag [54]. We have shown in a previous study that using a distance cut-off of 5 Å does not change the overall composition of contacting residues in Ab–Ag interfaces [34]. Contacting residues that were retained throughout Ab affinity maturation were defined as ‘germline contacting residues’. Contacting residues that were modified during Ab affinity maturations were defined as ‘SHM contacting residues’.

Energy calculation

We performed a computational alanine scan for all contacting residues in the Ab, and assessed the effect of this mutation on Ag binding. To assess SHMs, we mutated each introduced residue back to its germline residue. ΔΔG values were calculated using FoldX [28, 29]. The following steps were performed in both cases, as they differ from each other only in the mutation target (alanine or the corresponding germline residue). First, PDB structures were optimized using the FoldX RepairPDB function. Then each mutation was performed separately using the BuildModel function. This resulted in generation of mutants and their corresponding wild-type structure models. The heavy chain and the light chain of the Ab were grouped together to calculated the energy values of the assembled Ab, and the AnalyzeComplex function was used to calculate the binding ΔG of each model. The ΔΔG value for each mutant was then calculated by subtracting the wild-type ΔG value from the mutant ΔG value.

Ab structural division into non-overlapping structural regions

Contact between two residues was defined as at least two heavy atoms (one from each residue) within a distance of 6 Å. The region ‘Ag interface’ comprises all residues that contact the Ag but do not contact residues from the other Ab chain. The region ‘VH–VL interface’ comprises all residues that contact the other Ab chains but not the Ag. The region ‘both interfaces’ comprises Ab residues that contact both the Ag and the other Ab chain. The ABRs were identified using Paratome [30]. Residues in the ABR regions that do not contact the Ag or the other Ab chain were grouped as ‘ABRs not in interfaces’.

Amino acids within DNA hotspot motifs

The DNA hotspot motifs were RGYW or WRCY [10], where R indicates a purine base, Y indicates a pyrimidine base, and W indicates for an A or T base. For each amino acid, the proportion within hotspot motifs is the number of occasions the amino acid appeared within the hotspot motif out of the total appearances of the same amino acid in the germline sequences (V and J segments only) for all Abs in the dataset.

Distance from the nearest hotspot motif

For each amino acid or mutation up to position 105 (according to IMGT numbering) in the V region, the distance from the nearest hotspot motif (RGYW or WRCY) was calculated as described previously [12]. Briefly, the distance was defined as the number of bases between the middle codon and the nearest base of a hotspot motif. A distance of zero indicates that the middle codon is inside a hotspot motif. Since the motifs have four positions the center nucleotide of a codon is four times more likely to fall somewhere within the motif than to fall in any other specific distance from it. Therefore, the observed number of cases with a distance of zero was divided by four before calculation of distributions. Amino acids or mutations that had two hotspots within the exact same distance were counted twice for that distance (with opposite signs).

Amino acid propensity for mutation

The 196 Ab–Ag complexes were divided into three random subsets. The propensity of each amino acid to be mutated in each subset was calculated as:

display math

where inline image is the number of changes from amino acid AA1 in the germline to any amino acid in this particular structural region, inline image is the frequency of amino acid AA1 in the germline sequences of this structural region, and ‘mutations in the region’ is the number of mutations in the structural region. Priors of 1 were added. Propensity values from each of the random subsets were averaged and then used for standard error calculation.

Mutation probability and Ab position numbering

Abs positions and CDR definitions are numbered according to IMGT numbering [36]. The mutation probability was calculated as the number of mutations in a specific position divided by the number of appearances of an amino acid in this specific position. If the number of appearances of an amino acid in a specific position was ≤ 5, it was excluded from Fig. 6.

Standard error calculation

Standard errors for Figs 1, 4 and 5 were calculated by dividing the 196 Ab–Ag complexes or 210 general protein–protein interfaces into three random subsets. Values from each of the random subsets were averaged and then used for standard error calculation. For Fig. 6, ∆∆G values for each position were averaged and used for standard error calculation.


We are grateful to Vered Kunik for help with statistical analysis, and to Guy Nimrod and Sharon Fischman (both at Biolojic Design, Tel Aviv, Israel) for useful comments and discussion on the manuscript. This work was supported by grant number 511/10 from the Israeli Science Foundation.