Immunoglobulin gene analysis as a tool for investigating human immune responses

Summary The human immunoglobulin repertoire is a hugely diverse set of sequences that are formed by processes of gene rearrangement, heavy and light chain gene assortment, class switching and somatic hypermutation. Early B cell development produces diverse IgM and IgD B cell receptors on the B cell surface, resulting in a repertoire that can bind many foreign antigens but which has had self‐reactive B cells removed. Later antigen‐dependent development processes adjust the antigen affinity of the receptor by somatic hypermutation. The effector mechanism of the antibody is also adjusted, by switching the class of the antibody from IgM to one of seven other classes depending on the required function. There are many instances in human biology where positive and negative selection forces can act to shape the immunoglobulin repertoire and therefore repertoire analysis can provide useful information on infection control, vaccination efficacy, autoimmune diseases, and cancer. It can also be used to identify antigen‐specific sequences that may be of use in therapeutics. The juxtaposition of lymphocyte development and numerical evaluation of immune repertoires has resulted in the growth of a new sub‐speciality in immunology where immunologists and computer scientists/physicists collaborate to assess immune repertoires and develop models of immune action.


| 133
DUNN-WALTERS ET AL. matured in the germinal center, and are therefore able to meet the challenge in force across many different anatomical sites. Resolution of the response after the infection is defeated leaves behind memory cells carrying the effective BCRs in order to provide faster and more efficient protection, with greater affinity, should the same challenge be encountered again. The potential diversity of the naïve F I G U R E 1 (a) Variable (V), Diversity (D) and Joining (J) gene segments are arranged in a non-functional state in the germline. During V(D) J recombination, a V, a D and a J gene segment (just V and J in the case of light chains) are brought together at random. RSS sequences ensure gene segments are recombined in the correct order to form a functional variable region sequence. Blue, orange and purple rectangles represent V, D, and J gene segments, respectively, with gray leader regions upstream of the V genes. Turquoise and red triangles represent 12RSS and 23RSS, respectively. Constant region exons are represented by green rectangles. (b) Functional variable regions are composed of four conserved structural framework regions (FR) and three more diverse complementarity determining regions (CDR). The CDR3 regions are the most diverse as they span multiple gene segments and contain random nucleotide addition. C) The CDR loops make the most contact with antigen (PDB ID: 1FVC) immunoglobulin repertoire has been estimated to be in excess of 10 18 , which is 10 5 times more than the estimated number of B cells in the body. 3 The enormous diversity facilitated by V(D)J recombination has the disadvantage that some B cells may carry receptors that bind selfepitopes, leading to autoimmune disease, so we need mechanisms of tolerance to remove such cells. B cell receptors which bind self-antigen in the bone marrow are selected against via receptor editing (where the light chain of the B cell receptor is exchanged for a different light chain in an attempt to avoid self-reactivity) or cell death. B cell receptors which do not bind self-antigen proliferate and are released into the peripheral blood. Autoimmune disease may occur when central tolerance fails to remove autoreactive B cells before they leave the bone marrow. Several autoimmune diseases are associated with defective central tolerance mechanisms, for example, systemic lupus erythematosus (SLE), 4 rheumatoid arthritis (RA) 5 , and type 1 diabetes. 6 Autoimmune disease can also be a result of failed peripheral tolerance mechanisms, where self-reactivity is acquired outside the bone marrow and needs to be removed. The affinity maturation process of adapting to immunological challenge may, in itself, create autoreactive specificities which require removal from the repertoire. 7 In our own work, we have exploited the unique nature of immunoglobulin gene generation and maturation to investigate B cell dissemination and development in humans, especially with regard to how B cell protection diminishes, and autoimmune risk increases, with age. 8 Along this journey, we find that repertoire analysis methods also provide information about intrinsic processes of immunoglobulin diversity generation that may be of benefit in therapeutic antibody design and discovery.

| G ENER ATI ON OF B CELL D IVER S IT Y
Immunoglobulin genes are initially formed by gene rearrangement processes during B cell development in the bone marrow. Upon antigen activation they undergo further diversification by processes of somatic hypermutation and class switching in the periphery.  The CDR3 regions are the most   variable, as they are encoded by the regions of the immunoglobulin   where the different gene segments join together. Since light chain   rearrangement involves only V and J regions, the CDR-L3 is less   diverse than the CDR-H3, where the heavy chain region involves   two different joining sites, between IGHV-IGHD and between IGHD-IGHJ as well as the IGHD genes. Diversity at these joining sites is increased in the CDR3 regions because the processes of gene rearrangement are imprecise, exonucleases may remove nucleotides and nucleotides are randomly added in the process by the enzyme Terminal deoxynucleotidyl Transferase (TdT). Only B cells will have a rearranged immunoglobulin gene and this has been quite an advantage working with limited availability of human tissue, as cell purification prior to any PCR is not necessary. Indeed, Ig gene analysis has been used to establish the presence of B cells in a tissue, for example, the presence of B cells in the human thymus. 12

| Hypermutation
Unlike T cells, B cells can further diversify during an active immune response by somatic hypermutation, 13 a process which requires activation induced cytidine deaminase (AID) 14 and additional help, such as from T follicular helper cell interactions. 15 Somatic hypermutation takes place predominantly in the germinal center of follicles, where a Darwinian process of expansion, mutation and selection occurs, known as affinity maturation. 16,17 Cells acquire just one or two Ig variable region mutations in between rounds of selection 18 and maturing cells exit the process as memory or plasma cells. 19 Hence, when looking at the immunoglobulin gene rearrangements in a sample, the presence of mutations, in comparison to germline sequences, makes it evident that the cell has been activated by antigen. Thus, we could show for the first time that even though the B cells of the splenic marginal zone were not class switched, retaining IgM functionality, they were still antigen-experienced cells as their Ig genes were mutated. 20 In chronic lymphocytic leukemia (CLL) the extent of mutation was investigated to try and understand the etiology of the disease and it was found that there were two different classes of CLL with prognostic significance, those with mutated immunoglobulin genes and those carrying germline immunoglobulin genes. 21 The extent of hypermutation may reflect the ongoing activation of a B cell clone and, in agreement with this, we have found that the mucosal barrier environment, where there is constant immune challenge, holds B cells and plasma cells with highly mutated Ig genes compared to systemic tissues. [22][23][24] The extent of hypermutation has also been used to infer the likely activation pathway of a repertoire, with the assumption being that a T-dependent response would always produce B cells carrying more highly mutated Ig genes than a T-independent response. There is some evidence for this since patients with CD40L deficiency, whose B cells are unable to receive traditional T cell help, have fewer mutations in their class switched repertoire than controls. 25 Therefore, a study of the human immune response to Dengue infection, which showed a hypomutated repertoire, lead to a model of Dengue immune response involving the T-independent repertoire as well as the T-dependent response. 26 The question of whether an antibody has undergone antigen selection as part of its development has been asked in the context of studies on vaccine development, infectious disease, lymphomas and leukemias and autoimmune diseases. The initial hypothesis was that statistical comparison of replacement and silent mutation distribution across the IGHV gene would differ in an antigen-selected gene compared to the mutation expected if it were completely random with no selection pressure. Such that an antigen-selected gene would have more replacements than silent mutations in the CDRs which encode the antibody binding site, and conversely more silent than replacement mutations in the framework region of the antibody that is needed for antibody structural integrity. 27 Calculations then had to be modified to account for our discovery that even in the absence of selection, in out-of-frame gene rearrangements there were more mutations in CDRs than framework regions. 28 With the later determination of mutational hotspots, 29,30 that are the result of AID targeting and other DNA repair biases, 31,32 incorporation of targeting data into more complex algorithms enable improved prediction of whether a repertoire of antibodies has been selected or not. 33 Other nuances, such as positional effects with respect to transcription initiation sites, 34

| Class switching
The function of an antibody can be varied by changing its Fc In the human, IgG1 and IgG3 have high affinities for Fc receptors on accessory cells so it can mediate antibody-dependent cell cytotoxicity (ADCC) and help activate the immune system, these subclasses are also good at complement activation. On the other hand, IgG2 and IgG4 are essentially blocking antibodies since they have very low affinity for Fc receptors and no complement activation. It is worth noting that the mouse classes are not equivalent-IgG3, IgG2b, IgG2c having ADCC capability and IgG1 is the blocking subclass. Another difference between human and mouse is in IgA, where humans have two subclasses and mice only one. IgA is a mucosal antibody and can be secreted across barriers in the gut, breast, lungs, GU tract to block pathogens at mucosal surfaces. The major differences between IgA1 and IgA2 lies in the presence of the drastically extended hinge region of IgA1, thought to improve antigen recognition by increasing affinity with antigen epitopes that are spatially distant, but making it vulnerable to proteases. [37][38][39] The IgE antibody has received an increasing amount of attention because of its role in hypersensivity responses and allergy in the developed world, although initially thought to have evolved to target parasites (eg, helminths and parasitic arthropods) that are too large to be phagocytosed. [40][41][42] Class switching can be regulated by multiple factors and pathways, both T-dependent and T-independent. As is the case for somatic hypermutation, class switching requires AID, and is most often associated with the germinal center where interaction with T cells via CD40 is critical for the process. Experiments in T cell deficient and CD40 deficient mice have illustrated that germinal center-independent class switching can also occur, providing the correct stimuli are present. Signaling via Toll Like receptors (TLRs) can complement signaling through the BCR to activate both the non-canonical and canonical NFkB pathways and initiate class switching. 43 Similarly, binding of APRIL or BAFF, produced by accessory cells such as neutrophils, 44 innate lymphoid cells 45 or fibroblasts, 46,47 to TACI on the B cell surface will activate the NFkB pathway via MyD88 to cause expression of AID and class switching. 48 Expression of AID can also be increased by estrogen acting via the HoxC4 AICDA gene activator. 49 The isotype that a B cell will switch to is affected by the environment and signals that the cell receives. In a T-dependent response the cytokines produced by T-helper cells have a critical effect on class switching; IL4 encourages switching to IgG1 and IgE, IL5 and TGFβ encourage switching to IgA, IFNγ encourages IgG3 and IL10 encourages IgG1 and IgG3. There are many other factors which influence the type of class switching. An analysis of the constant region class switch sites in the DNA sequence has revealed many examples of steroid hormone receptor binding sites. Vitamin A helps class switching to IgA and away from IgE, and Vitamin D has also been shown to regulate IgE production. 50 The discovery of potential nuclear receptor binding sites in the regions of DNA that control class switching raises the possibility that class switching could be directly controlled by vitamins and hormones. 51 Metabolites such as prostaglandins can also have an effect, PGE2 acting via STAT6 enhances IL4-mediated class switching to IgE 52 and can increase IgG1 class switch via cAMP. 53 The class of an antibody is determined by the constant region gene that follows the VDJ variable region on the immunoglobulin heavy chain gene. In humans, the genetic order of constant region genes in the genome on Chromosome 14 is μ, δ, γ3, γ1, α1, γ2, γ4, ε, and α2. Multiple consecutive switches between different classes and subtypes may occur. Both class switching and somatic hypermutation are related, both occurring after activation by antigen and requiring AID, therefore class switched antibodies will exhibit hypermutated Ig genes. Since mutations accumulate gradually during a response, the temporal events in the life of an activated B cell clone can be ordered by using the level of somatic hypermutation as a molecular clock. Thus, the prevalence and order of class switching can be estimated by analyzing lineages in high throughput Ig repertoire data. 54,55 The dominant class switching pathway (approximately 85%) is from IgM/D to IgG1 or IgA1 and switching to the downstream classes is usually achieved by sequential events, for example, from IgG1 to IgG2 or IgA1 to IgA2. The "time", in terms of hypermutation accumulation from one class switched gene to a further downstream one, is less than the "time" taken for IgM/D switching in the first place. More closely related cells are more likely to switch to the same class than more distant ones, in vitro as well as in vivo, possibly as a result of an imprinted state being passed on to progenitors. 54

| REPERTOIRE ANALYS IS APPROACHE S
Techniques that amplify and sequence the repertoire have been collectively referred to as Rep-Seq. 56 The initiating step in B cell repertoire studies was the identification of a full suite of PCR primers that could amplify all expressed heavy chain variable regions in a consensus PCR. 57 Early Ig repertoire analysis used PCR primers that bound in the Variable and Joining regions of the rearranged Ig genes to prepare the amplicon libraries for sequencing. While this had the advantage of being a robust method it did not produce data on the antibody class unless the cells had been sorted using surface markers prior to library generation. It also potentially biased the measurements of J region usage and was open to the risk of V region bias due to faulty primers by virtue of the fact that the V region primers were a mix of family-specific primers. While these early sequencing technologies were invaluable for the discovery of new cell populations, they often relied on expensive and time-consuming cloning that did not capture the full repertoire; due to the single channel capabilities of Sanger Sequencing. 20,22,29,58 Advances in Rep-Seq in terms of primer design, coupled with next-generation sequencing, enabled the full repertoire to be explored with the only drawbacks being difficulty amplifying rare heavy chains, PCR and sequencing bias, and amplification of IgG which is consistently less efficient than other heavy and light chains. A further step forward came with the use of template switch enzymes and 5′ RACE, as has been frequently used in T cell biology. 59 [75][76][77][78] Paired end data can also be limited in ability to distinguish some somatic variants. 79 As such, the Pacific Biosciences (PacBio) RSII system which offers reads lengths of 10 000 bp on average has become increasingly attractive for specialized applications 80 despite its comparatively poor reads per run and high cost (see Table 1). The use of barcodes, a string of known nucleotides added to individual samples by using multiple specifically produced primers, allows simple multiplexing on higher cost sequencing platforms but is currently still expensive. We expect that advances in the PacBio read numbers will continue to improve, as has been the case with the release of the Sequel platform offering a ten-fold increase in read per run over the RSII, while Illumina technology will remain unparalleled in terms of reads per run, but is plateauing on read length improvements. The use of one platform over another in the short term will therefore largely depend on what is required by the researcher (see Table 1).
With these advances in Rep-Seq, long read sequencing technologies and with 3′ PCR primers sufficiently far down the constant region, the distinction between subclasses has enabled a full investigation of antibody class in the repertoires. This is important, as we have shown that the repertoire can vary quite substantially by class of antibody. While IgG1 and IgG3 seem to share repertoire characteristics, IgM cells and IgG2 can vary substantially, particularly in younger adults. 81 In older adults, the selection events shaping the repertoire seem to change. 8 In human, the main variations in IGHV gene usage seem to be in the relative use of IGHV1 and IGHV3 family genes, 81 82 and others have extended this to show plasmocyte differences. 83 We have also shown repertoire differences as B cells progress through bone marrow development and central tolerance. 84 These studies all serve to reinforce the view that repertoire studies should be conducted on sorted cells, be class and subclassspecific and the subjects should be age matched as well as possible.
The most recent advances in Rep-Seq have come with the use of single cell technologies which allow the full antibody structure, both the heavy and light chain from a single cell, to be uncovered.
These technologies often also have the capacity to produce single cell transcriptomic data (scRNA-seq), the estimated prices for some of the more popular methods are included in Table 2 and see also TA B L E 2 The costs of running some of the more prominent single-cell technologies. Note that prices are estimates and may vary as a result of different suppliers, exchange rates and prices scalable on quantity purchased. None of these costs include sequencing, see This cost is based on an 'off the shelf' model although methods exist for self-assembly. For Drop-Seq and Ig pairing by overlap extension we have used Dolomite Bio as our reference. In this case as well, buying the equipment for one method will reduce the equipment purchase price for the other as parts are interchangeable. c The 10X system uses the same machine for both methods. Note that the system will also perform both scSeq and paired heavy light chain from the same sample for US$65 more and TCR on top of that at an additional US$65.
methods that use scRNA-seq data may also be used to reconstruct the joint heavy/light chain repertoire coupled with the full transcriptome. 92,93 In 2017 10x Genomics produced chemistry kits for their Chromium machine which are capable of producing barcoded libraries for sequencing that can be separately enriched for BCR or TCR data. To date, however, we have not seen any publications that have implemented this. We believe that these new joint heavy-light chain technologies will form the basis of repertoire analysis in the future, as was the case with class and subclass isotyping, because of the additional structural and full variable region data that can be attained.

| CLONALIT Y ANALYS IS
Given the available genes, and the probabilities of nucleotide excision/addition, the CDR-H3 region of heavy chain gene rearrangements is highly diverse, producing unique sequences at each rearrangement event. There are some rare instances, where the CDR-H3 is very small such that the probabilities weigh in favor of seeing the same CDR-H3 in two different rearrangement events, 94 but in general the CDR-H3 can be used as a fingerprint for a particular B cell and its progeny and one would not expect to see two different B cells with the same CDR-H3 in a small sample unless they were related. Clustering immunoglobulin sequences into "clones" allows studies of B cell relationships between different samples and can facilitate the study of repertoire both as a whole, and also looking at the background diversity without the effects of clonal expansion.

| Dissemination
Matching IGH genes with the same CDR-H3 in different areas of tis- Since hypermutation levels will always confound this analysis it is impossible to get 100% specificity and sensitivity in the clonal allocation, but it is easier to split an incorrectly clustered clone upon closer inspection than it is to know about potential missing sequences.
A recent paper concluded that single linkage hierarchical cluster-

| Clonal expansion
A key factor in assessing the immune response is to identify the ex- More importantly, we would not be able to find information on the class of antibody under investigation. It is our recommendation that Ig gene repertoires be prepared from mRNA isolated from presorted B cells, adding UMIs and, using 3′ PCR primers in the constant region that allow later discrimination between antibody subclasses.
Given the technological capability of producing monoclonal antibodies for therapeutics there are many instances where we would like to know the sequence(s) for the antibody/antibodies responding to a particular challenge. It has been assumed that a B cell clone that is most expanded in response to challenge would be the most useful in protecting the host from the challenge. Indeed, there are several reports where the predominant clones in a response have been shown to bind the antigen. 103 In mice these experiments have been particularly successful. 104 However, the assumption of largest clone providing best protection may be too simple, and many different immunoglobulin genes can respond to a single challenge. Human stud- may not always be possible, even when temporal data for the response is available. 107 Comparison of predicted sequences from the whole repertoire with sequences obtained after sorting B cells labeled with the specific antigen can help to develop models for in silico prediction of antigen-specific sequences in a repertoire. 111 We do need to bear in mind that a sampling of blood B cells for sequence repertoire is not the same as sampling the antibodies produced in response to challenge. 112 The latter are produced by plasma cells in the bone marrow and the former are more diverse. In addition, we cannot always assume that a large clonal expansion of IgG would indicate best protection. Other classes of antibody have been shown to be important, such as IgM in Ebola, 113 which may be less focussed in their clonal expansion response. In our laboratory, preliminary experiments using ribosome display to capture antigen-specific sequences do find sequences that we see in the whole repertoire, but not in the largest clonal expansions and often are isotypes other than IgG.

| Clonal evolution
Examination of clustered data on an individual clone level can provide information about the evolution of a B cell clone as the Ig genes acquire mutations in the immune response. It is important to know whether an ongoing expansion of cells is just that, expanding exactly the same immunoglobulin gene, or whether there is also ongoing mutation involved-which would imply the involvement of a more complex germinal center reaction and affinity maturation. Determining the relative position of cells from different phenotypical subsets within a lineage tree may also be able to provide information as to the order of lineage relationships.
We have used manually curated lineage trees to show changes in germinal center selection with age, relationships between different types of memory B cells and ongoing diversification in MALT lymphoma. 24,29,114 Transferring these more in-depth analyses to high throughput methods is dependent on the accuracy of sequence information, and there is a sense of reluctance in the field to take clear biological inferences from what may not be the most precise data. HTS methods that incorporate UMIs and that provide multiple reads of the same unique sequence may be able to provide data which would overcome this reluctance and it may even be possible to correct sequencing data without the aid of UIDs with the appropriate algorithm such as IgReC. 78 In addition, there are computational methods available for the construction of lineage trees. 115,116 We also need to recognize that allelic variants may exist in the population that may not be represented in germline gene databases and therefore some "mutations" from germline may be miscalled. These could potentially skew hypermutation data from different patients and there are now methods for predicting germline genes by inference from high throughput data which can help overcome this issue. [117][118][119][120] The earliest analyses of antibody lineage trees employed graph theory to extract metrics with respect to the shape of the trees and analyze how these correlated with biological parameters. 24,[121][122][123][124] Later methods are reviewed elsewhere. 125

| G ENE US E ANALYS IS
Comparison of the frequency of use of different immunoglobulin genes between different samples is a useful biomarker for biological skewing of the lymphocyte repertoire. Some individual genes have been identified as being associated with human disease. IGHV5-51 is associated with Celiac disease. 128 IGHV4-34 has often been associated with autoimmune disease and chronic lymphocytic leukemia. [129][130][131] has been shown to bind citrullinated protein antigen in rheumatoid arthritis, 130 but it also has a unique framework 1 region that can bind to human red blood cell antigens I and i when in its germline form, 132 these antigens can therefore be considered to be superantigens. It is one of few antibodies that has an N-glycosylation site in the germline IGHV region, and it has been hypothesized that the potential autoreactive binding potentials can be modified by changing glycosylation in a germinal center reaction. 133

| CDR3 CHAR AC TERIS TIC S
The question of which part of the antibody is the most important for antigen binding is an interesting one. As mentioned above, the CDR3 region is the most variable part of the antibody by virtue of the contributions from the different genes at the junction and the imprecise nature of the gene rearrangement process. Mice restricted to a single Variable region gene have shown that they are capable of eliciting high affinity responses to various protein and hapten challenges, which is evidence to support the idea that CDRH3 is the most important sequence conferring specificity of the antibody. 144 They did find that their arbitrarily chosen V region did not support binding to T-independent polysaccharide antigens, so there is reason to believe that CDR1 and 2, and perhaps other aspects of the sequence are also important for certain classes of antigen. Other evidence suggests that V gene use makes a significant difference to antigen recognition. Contact residues may not always be part of the CDR 145 and the same CDRH3 on different heavy/light chain backgrounds can take on different structures. 146 As a result of the complexities of protein folding behavior, selection of mutations for affinity may not be directly related to contact residues. 147 We looked for any biases in CDR3 properties between different IGHV family genes in our data. While most IGHV genes did not appear to affect the CDR3, use of IGHV2 family genes showed a skewing in CDR3 properties compared to the rest of the repertoire, indicating IGHV2 has an effect on CDR3 structure that in turn affects antigen binding sufficiently to affect repertoire selection (Figure 2c). That said, IGHV2 family genes are a very small fraction of the repertoire as a whole, so while it is worth bearing in mind when interpreting CDR3 repertoire information it would only be of concern if the IGHV2 component were altered for any reason.
Much work on the effects of changing CDR3 sequence on antibody specificity has been done in mice 148  Similarly, the level of N nucleotide addition in early B cell development is consistent between heavy, kappa and lambda chains within individuals, but differs between individuals. 152 Given the apparent importance of CDR3 size to an antibody response 82,149 and to central tolerance 84,150 these interindividual differences may warrant closer inspection in studies on immune disease, vaccination and infection as they may be biomarkers of response or autoimmunity.
The physicochemical characteristics of the CDR3 are also important, not only from the point of view of how they affect protein folding, and therefore the shape space of the binding site, but with respect to their ability to interact with other molecules. For example, folding of the CDR-H3 can be affected significantly by the presence of pairs of cysteines, which can form disulphide bonds. 147 We found that there is some selection against the use of cysteines in central tolerance; the percentage of sequences without any cysteines increases from 85% to 91% between preB and naïve B cells. Although it is difficult to infer an antibody's specificity based on its amino acid sequence, it has been observed that the CDR-H3 regions of antibodies in the bone marrow are on average longer, and more hydrophobic than those in the peripheral blood 84,1151,152 , indicating that these CDR-H3 characteristics are selected against during central tolerance. The charge at the binding site is also critical, the prevalence of positively charged arginines in the CDR3 has been associated with binding to (negatively charged) DNA in some antibodies and in SLE 153,154 and to phospholipid antigens. 155  genes and the rest of the repertoire (Figure 2c) we found significant changes in KF2:Side chain size, KF5:Double bend preference, KF6:Partial specific volume, and KF7:Flat extended preference.

| ANTIBODY S TRUC TURE
Given the differences in CDR sequence characteristics between antibodies it is easy to see that the information of real relevance to design of effective antibodies lies in the structure encoded by that sequence. The major hurdle to date has been that immunoglobulin repertoires have either been single chain only, or have been too short to have the full sequence of both chains. Assuming that the single cell and long read technologies will be able to correct this in the near future, then the next challenge will be modeling the protein structure. The steps involved in modeling are reviewed in detail elsewhere, 151 and the challenges are mainly with the CDR3 regions for which suitable templates are not always available in the protein data bank (PDB). We have produced some structures for antibodies that are polyreactive, showing that their long CDR-H3 loops appear to project out of the antigen binding site, but the longer the CDR-H3 then the more likely the antibody would have a flexible conformation and this work is still in its preliminary stages. 160 Others have usefully employed modeling techniques to investigate the maturation of anti-HIV and anti-influenza antibodies. 161 The pipeline for our modeling to date involves making multiple models initially and picking the best one before performing multiple simulations of conformation, using tCONCORD to give an ensemble that can be analyzed. 160 Although this rigorous treatment gives us confidence in the predicted structures, it is computationally quite expensive and difficult to apply in high throughput. A recent paper that used the RosettaAntibody ABodyBuilder is the speediest, at 30 seconds per structure, which is around 567 CPU hours per thousand sequences. 151,164,165 In addition to protein folding, the glycosylation status of antibodies is important. Not many immunoglobulin genes have N-linked glycosylation sites in their variable regions in germline configuration (IGHV4-34, IGHV1-8, IGHV5-a), but it is possible to gain these sites through somatic hypermutation. 133 High throughput repertoire studies show that some genes are more likely to acquire an N-glycosylation sequon than others, for example, IGHV3-23 and IGHV6-1 133 and sequons are more often found in or near the CDRs where they are more likely to affect antigen binding. 166 While in most instances the lack of glycosylation on selected antibodies would indicate that the glycans block or reduce binding, there are a few instances of N-glycosylation conferring increased antigen specificity. 166,167

| SUMMARY
There are many areas of biology and medicine where the information available from repertoire data can provide valuable insight. With the increasing importance of biologics as therapeutics, repertoire studies also have a valuable place in the discovery and design of antibodies and chimeric antigen receptors. The study of such large numbers of sequences, with all the complexities that they entail, has resulted in an interdisciplinary field that encompasses immunologists, physicists, computational biologists and mathematical modelers as well as providing a substantial collection of methods and tools. The immediate future directions are to encourage order and standards with respect to tools and data repositories, while at the same time improving existing biological and computational methods to address the challenge of producing accurate paired chain repertoires with tractable high scale structural modeling methods.

CO N FLI C T O F I NTE R E S T
Authors have no conflict of interest.