Comparative analysis of nanobody sequence and structure data

Abstract Nanobodies are a class of antigen‐binding protein derived from camelids that achieve comparable binding affinities and specificities to classical antibodies, despite comprising only a single 15 kDa variable domain. Their reduced size makes them an exciting target molecule with which we can explore the molecular code that underpins binding specificity—how is such high specificity achieved? Here, we use a novel dataset of 90 nonredundant, protein‐binding nanobodies with antigen‐bound crystal structures to address this question. To provide a baseline for comparison we construct an analogous set of classical antibodies, allowing us to probe how nanobodies achieve high specificity binding with a dramatically reduced sequence space. Our analysis reveals that nanobodies do not diversify their framework region to compensate for the loss of the VL domain. In addition to the previously reported increase in H3 loop length, we find that nanobodies create diversity by drawing their paratope regions from a significantly larger set of aligned sequence positions, and by exhibiting greater structural variation in their H1 and H2 loops.

nanobodies do not diversify their framework region to compensate for the loss of the VL domain.
In addition to the previously reported increase in H3 loop length, we find that nanobodies create diversity by drawing their paratope regions from a significantly larger set of aligned sequence positions, and by exhibiting greater structural variation in their H1 and H2 loops.

| I N TR ODU C TI ON
The efficacy of an immune system directly reflects the diversity of antigens against which specific, tightly binding B-lymphocyte antigen receptors (BCRs) can be generated. Conventional full-length antibodies (Abs) have become essential tools in biological research, and a foundation of the biopharmaceutical industry due to their exquisite binding specificity and high affinity to target antigens. The immense diversity of binding specificity is created by sequence variation in two variable domains, the heavy chain (VH) and the light chain (VL). Together, these have been estimated to yield a diversity of at least 10 15 possible BCRs in humans, 1-3 easily exceeding the Blymphocyte population size in an individual ( 10 11 ). 4 Exactly how the enormous potential sequence diversity translates into antigen specificity is not known. It is clear that there is some redundancynot every sequence unique VH-VL combination results in a unique binding specificity. However, the number and locations of amino acid mutations required to change binding specificity have proved difficult to predict. 5,6 A potentially more tractable system is provided by the class of heavy-chain antibodies found in camelid species such as camels, llamas and alpacas, for which the light chain is completely absent (Figure 1).
The 15 kDa isolated variable VHH domain, known as a nanobody (Nb), is approximately ten times smaller than a conventional Ab yet retains comparable binding specificity. 7 Nbs exhibit improved stability and are able to bind Ab-inaccessible epitopes in enzyme active sites, viral capsids and G protein coupled receptors. [8][9][10][11][12] The camelid VHH domain that forms the Nb is homologous to the Ab VH domain and contains three highly variable loops H1, H2, and H3 ( Figure 1). These loops form an extended structural interface at one side of the folded protein domain that contributes to the antigen-binding interface, or paratope, which determines the Nb antigen-binding specificity. 7 The potential sequence diversity of these three loops is much smaller than that of the six highly variable loops that largely determine the binding specificity on the Ab VH-VL domain complex. 2,[13][14][15] For both Nbs and Abs, the fundamental challenge is to determine the molecular code that relates amino acid sequence, and in particular  16 In conventional Abs, the paratope lies at the interface of the VH and VL domains and typically contains residues from as many as six distinct hypervariable loop regions. 17 There is also considerable freedom in how the VH-and VL-domains dock together, allowing the Ab to maximize the diversity of possible antigen-binding surfaces. 13 In contrast, the Nb paratope is entirely contained within the VHH domain ( Figure 1C), drastically reducing the space of possible antigen-binding surfaces without apparently affecting the diversity of resulting binding specificities. 10 Indeed, Nbs typically bind their target antigen with affinities comparable to those achieved by classical monoclonal Abs. 2

How
can Nbs generate such a diversity of binding specificities despite their small size and single domain architecture?
To address this question, we compile a dataset of 90 nonredundant protein binding Nbs for which Nb-antigen co-crystal X-ray structures are available, allowing us to identify and study a diverse set of Nb paratopes.
Early studies examined sequence and structural diversity in the hypervariable loop regions of small sets of Nbs. 2,10,[18][19][20] The sequence and structure dataset that we compile allows us to examine the Nb paratope in detail. The collated structures span a diverse set of 72 distinct epitopes on antigens that include membrane proteins, viral proteins, enzymes, and intrinsically disordered proteins. To facilitate comparison with classical Abs we collate a comparable dataset containing 90 nonredundant protein binding Abs for which Ab-antigen co-crystal X-ray structures are available. We use structural alignments to compare sequence and structure variation within the variable loops and framework regions of Nb VHH domains and Ab VH domains. These datasets enable us to provide the first comprehensive description of the sequence and structural variability present in Nb VHH domains, and to examine how this variability is exploited to generate distinct binding specificities.

| Data collection
We built two datasets of co-crystal protein complex structures, each containing a Nb-antigen or Ab-antigen protein complex. Our novel dataset consists of 90 nonredundant protein binding Nbs for which Nb-antigen co-crystal structures are available in the PDB. 21 To construct this dataset all crystal structures containing protein-bound Nbs were downloaded from the PDB, and filtered using a 100% sequence identity threshold to give rise to a set of sequence-unique Nb structures. No crystal structure resolution threshold was applied, to maximize the number of structures in the dataset. The structural resolutions range between 1.1 and 4.1 Å, with a mean of 2.2 Å. Of the 90 Nbs, 60 derive from Lama glama (llama), 24 from Camelus dromedarius (camel), and 6 from Vicugna pacos (alpaca). The Nbs in the dataset bind to 72 structurally unique epitopes, where a unique epitope is defined as having <50% of epitope residues in common with another epitope in the dataset. The set of 90 Ab-antigen complex structures were chosen at random from those used in a recent study. 22 The Ab dataset contains 90 sequence distinct VH domains, which bind 75 structurally unique epitopes, with resolutions that range 1.2-2.8 Å, with a mean of 2.3 Å.

| Sequence alignment
Sequences of Nb VHH domains were extracted from the PDB files and aligned using the online tool ANARCI 23 with the AHo alignment scheme, 24  AHo is included in Supporting Information Table S4.

| Sequence logos
All sequence logos were generated by submitting the relevant sequence alignments to the WebLogo server Version 2.8.2. 25 The logos in Figure 3

| Structure alignment and identification of antigen contacting residues
The antigen-bound Nb and Ab structures were then used to build a global structural alignment for each dataset, shown in Figure 4, using the superposition function in MOE 2015.1001. 26 For the Ab dataset, we aligned the part of the molecule consisting of both the VH and VL domains to reflect the complete structural unit involved in molecular recognition and antigen binding, as it is known that the angle between the VH-and VLdomains can vary significantly. 13 For the Nb dataset, we aligned the structures over the VHH domains. To identify antigen-contacting residues from the two sets of structures, we identify all residues in each co-crystal structure for which the minimum atom distance to the nearest antigen amino acid is <5 Å. Those residues that contact the antigen in > 10% of structures are marked on the sequence alignment in Figure 3 in yellow.
To quantify the structural variability of different sections of the Nb VHH domain and the Ab VH domain we calculated the average  Tables 1 and 2. We first extracted the framework region C-a atomic coordinates from each PDB file, and then computed the optimal structural alignment between every pair of dataset structures using the Procrustes or Kabsch algorithm. 27 In each case, the pairwise RMSD for each structural subunit

| R E SULTS AN D DI SC USSION
To investigate and compare the molecular code used by Nbs and Abs to generate diverse binding specificities we identified 90 co-crystal structures for both Nb-antigen and Ab-antigen complexes. The set of 90 Nb complexes contains 72 distinct structural epitopes, while the set of 90 Ab-antigen complexes contains 75 distinct structural epitopes.
Using this data we analyze Nb and Ab structural variation in the antigen bound conformation.
These data allow us to address the key question of how Nbs generate such a diverse range of binding specificities despite their reduced sequence length and compact single domain architecture. Much work has shown that the six hypervariable loops shown in Figure 1C are key to determining Ab interaction specificity. 6,22,28,29 Nbs, in contrast, have only three hypervariable loops, reducing the space of possible sequence variants and hence the potential interaction specificities. This could be compensated in a number of ways, for example: (i) increasing the length of the three variable loops, 19,30 (ii) increasing the level of sequence variation outside the variable loops, and (iii) increasing the diversity of amino acids within the loop regions. Here we use our assembled data to address this question.

| Nanobody sequence analysis
In Figure 2 Figure 2A) and in cyan (Ab alignment, Figure   2B). Both alignments display high levels of variability within the three loop regions H1-3. Perhaps surprisingly, Figure 2 shows that overall the Nb framework is much more conserved than that of the Ab VH domains.
It is important to verify that the framework conservation shown in Figure 2A is not merely an artifact of the small set of camelid species that produce Nbs. Species specific sequence and structural properties of classical Abs have previously been reported. 32  and Camelus dromedaries from the abYsis database. 33 From each singlespecies alignment we randomly draw subsets of 90 sequences, and plot the resulting species-specific conservation profiles in Supporting Information Figure S4. We find that subsets of 90 Ab VH domains drawn from single species show much greater sequence diversity than even the multi-species Nb alignment shown in Figure 2A. The greater Ab diversity likely reflects the availability of many V gene families (IGHV1, IGHV2, and so forth) that give rise to Abs, compared to the smaller number of V gene segments in IGHVH that are used to produce the majority of Nbs. 7,34 The striking finding revealed by Figure 2 is that the Nb VHH domain framework is significantly more conserved than the Ab VH  Sequence and structural analysis of Nb hypervariable loops H1-3. We characterize amino acid usage, loop length distribution, sequence diversity and structural diversity of Nb A, H1 loops, B, H2 loops, and C, H3 loops. Central loop positions with more than 85% aligned gap characters, which were excluded from the 126-position alignment, are highlighted in yellow. Framework "anchors," highlighted in gray, can be used to detect the loop locations in individual sequences 19 [Color figure can be viewed at wileyonlinelibrary.com] FIGURE 6 Sequence and structural analysis of Ab VH domain hypervariable loops H1-3. We characterize amino acid usage, loop length distribution, sequence diversity and structural diversity of Ab A, H1 loops, B, H2 loops, and C, H3 loops. Central loop positions with more than 85% aligned gap characters, which were excluded from the 126-position alignment, are highlighted in yellow. Framework "anchors," shown in gray, are more variable than in the VHH alignment, but can still be used to detect the loop locations in individual sequences [Color figure can be viewed at wileyonlinelibrary.com] The greater framework conservation in Nbs is surprising; as Nbs might be expected to increase their sequence variation, to compensate for the reduced potential sequence diversity due to their small size. and W47F/G are often shielded from solvent by the H3 loop. 7,34,35 Our data support sequence differences reported in previous studies. 19,20 In addition Figure 3 shows that throughout the FR3 region Nbs make greater use of charged residues. Specifically D62, K65, R67, R72, K76, and E89 are all more prevalent in Nb VHH domains compared to Ab VH domains, increasing the solubility of the isolated VHH domain.
Our datasets of co-crystal structures allow us to identify residues that lie on the interface with the protein antigen-that is, the paratope (see methods). These sequence positions play an important role in determining interaction specificity. Overall, the average number of antigen-contacting residues is 18.76 per VHH structure, compared to 16.01 per VH structure and 24.91 per VH-VL unit.
In Figure 3 reported to be longer in Nb VHH domains. 7,10,19,30 The additional paratope positions include multiple framework residues located at the Nterminus, across the FR2 region, and adjacent to the hypervariable loops. Furthermore, Figure 3 shows that paratope positions tend to be less conserved than neighboring nonparatope framework positions, as might be expected if they help determine interaction specificity.

| Structural variation of framework regions
Structural variation of the framework region is another potential source of Nb diversity. To examine this we built structural alignments of Nb VHH domains, Ab VH domains, and Ab VH-VL complexes. Figure 4 shows alignments of the 90 antigen bound Nb structures ( Figure 4A To measure the structural diversity present in these datasets we first extracted the C-a atomic coordinates corresponding to the framework regions. We then used the Procrustes (Kabsch) algorithm to find the linear transformation resulting in the optimal alignment for each pair of crystal structures and built the set of all pairwise alignments for each dataset. 27 This allowed us to compute the matrix of pairwise RMSD values for each dataset-the Nb VHH domains, the Ab VH domains, the Ab VH-VL domain complexes and also the apo form Nb In the case of the Ab VH-VL units we carried out the alignment using two approaches: first, we aligned over all framework C-a atoms from both the VH and VL domains and secondly, we aligned over just the framework C-a atoms from just the VH domain. The second approach results in a lower RMSD.

| Nanobody loop analysis
Our analysis so far suggests that any additional mechanism for generating Nb interaction specificity must be contained in the three hypervariable loops H1-3. Indeed, Figure 2A suggests that in contrast to the framework regions, Nb loops are at least as sequence variable as Ab VH domain loops. Furthermore, Figure 4A shows that the greatest structural variability is found in the Nb H3 loop (shown in red). Previous work has found that Nbs possess particularly long H3 loops, 7,10,19,30 which could potentially greatly increase both the sequence and shape diversity of the Nb paratope.  Table 2.
We focus first on the H1 loops.  Figure 6.
Calculation of the average pairwise RMSD of H3 loops reveals the extent of structural variability (see Table 2  range of specificities they provide, in binding to "hidden" epitopes that are otherwise inaccessible to larger classical antibodies. 7,9,40  increases the space of possible sequences by a factor of 20, and potentially the space of interaction specificities. Furthermore, the number of structural conformations the loops can adopt for a given sequence also increases as a function of the loop length. In addition, our dataset shows that Nbs encode around 7% more sequence diversity per amino acid residue than their Ab H3 loop counterparts. Longer H3 loops are thought to enable Nbs to bind to antigens by using fingerlike protrusions that extend into epitope cavities. Notably though, our set of Nbs also includes H3 loops that are shorter than any in the Ab dataset (Figures 5C and 6C). These findings suggest that Nbs exploit increased diversity in their H3 loops to enable them to generate the ability to bind tightly and specifically to antigens they are challenged with.
The third mechanism has not to our knowledge been previously reported in the literature. Our co-crystal structure dataset reveals that Nb paratopes are drawn from 50 aligned sequence positions, significantly broader than the 35 positions employed by classical Ab VH domain paratopes. This is despite the fact that on average each Nb uses an average of only 2.75 additional antigen-contacting residues compared to Ab VH domains. This suggests a novel mechanism that Nbs exploit to generate diverse binding specificities. The ability to draw paratope residues from a larger set than VH domains will promote diversity of both the shape and the physical properties of the antigen-binding interface, enabling Nbs to use a broader range of their surface to interact with their cognate antigen through different binding modes. Perhaps surprisingly, the aligned Nb sequence positions with high contact propensity are not particularly sequence diverse. Figure 3 highlights positions. It appears that Nbs compensate for the missing VL domain by using a similar number of conserved paratope positions to combined VH-VL units.
In summary, we have found that Nbs do not appear to generate specificity-enabling diversity through increased sequence or structure diversity in the framework. It is particularly surprising that, compared to Abs, Nbs do not exploit increased sequence variation in loops H1 and H2 to compensate for the loss of the VL domain. However, Nbs do exhibit increased structural variation, in particular in the H1 loop. The Nb H3 loop is the only part of the domain with greater sequence diversity than Ab VH domains; and the majority of this increased sequence diversity is achieved through the incorporation of, on average, only 3-4 additional residues. Apparently the loss of the additional diversity that would be generated by a cognate VL domain is compensated for using a very small insertion, coupled with the freedom to sample more diverse loop conformations. This indicates that the capacity for molecular specificity in a small protein domain is much higher than we might expect based on classical Abs, suggesting that there is exciting potential for generating high affinity specific binding to a diverse range of targets using short amino acid sequences that are relatively constrained.