For years, researchers interested in antigen receptor diversity and usage laboriously sequenced individual clones representing only a tiny fraction of the complete repertoire [1, 2]. A general characterization of the repertoire required the use of nonsequencing “analog” technologies, such as immunoscope/spectratyping or hybridization-based methods [3-7]. With the advent of next generation sequencing technologies, it is now possible to sequence millions of receptor clones representing the entire immune repertoire in a routine experiment. The initial applications of such technologies have produced many high profile publications [8-15], at least three start-up companies and the creation of a repository for the avalanche of data that is sure to be generated . As in many areas of biology, the rate limiting steps now seem to be data management and analysis rather than acquisition.
Compared with the more limited and abstract measurements from analog methods, the rich detail of digital datasets provided by next generation sequencing provides exciting opportunities for immunologists. The leading platforms — 454, Illumina, and Ion Torrent — can provide long enough reads to sequence the CDR3 region of an antigen receptor. All are now available in lower cost benchtop devices as well as expensive high-throughput machines, heralding an era in which next generation sequencers will join flow cytometers as basic laboratory tools for immunologists.
Nevertheless, next generation sequence datasets incorporate important sources of error, which are important to understand and mitigate. In this issue of the European Journal of Immunology, Bolotin et al. illustrate the extent of these problems by using the three leading technologies in parallel to sequence rearranged T cell receptor (TCR) genes from replicate samples (Fig. 1). Profound discrepancies are revealed, with the three datasets differing significantly in terms of the diversity and relative abundance of clones within the overall repertoire. Using these comparisons, together with an understanding of the pitfalls that complicate each of the methods used, the authors present platform-specific approaches to error correction. The authors note some relative advantages and disadvantages of the methods, summarized in Table 1, with the caveat that error rates, read number and length, speed, cost, and technical difficulty are changing as improvements in these methods are introduced.
|454||Longest read lengths||Lowest read number, resulting in bottlenecks|
|Illumina||Greatest read number||Highest error rate|
|Shortest read length|
|Ion Torrent||Frameshift errors|
|Short fragment length requires highly multiplexed PCR, resulting in amplification bias|
Traditional sequencing based on Sanger technology is robust because the huge number of identical template molecules average out random errors occurring at the single molecule level, thereby providing a high signal-to-noise ratio. In contrast, the throughput advantages of next generation sequencing come at the price of much higher error rates. These error rates reflect the fact that single molecule templates are sequenced, requiring additional amplification and detection steps, which must cope with decreased signal-to-noise ratios while balancing throughput and accuracy.
Next generation sequencing technologies differ in terms of template amplification, chemistry, and detection principles, leading to different rates and patterns of error. For example, the chemistry used by 454 and Ion Torrent does not terminate after a single nucleotide addition. If the same nucleotide is present at several adjacent positions, this is detected by an increase in the light or redox signal, but this increased signal is not fully linear with respect to the number of repeated nucleotides. Accordingly, frameshift errors can occur with templates containing homopolymeric stretches of a single nucleotide . Illumina's sequencing chemistry, which results in the addition of a single base per cycle, is not as vulnerable to this particular type of error. However, other idiosyncratic errorsare recognized with this platform .
In genomic applications, next generation sequence data that covers the entire template many times can easily be averaged to reduce errors, thereby accommodating a higher error rate. This approach is possible because coding sequences in genomic DNA are usually present in equimolar amounts, are clearly distinct from each other, and can be differentiated by flanking sequences. Unfortunately, these approaches are not suitable for antigen receptor sequencing. First, distinct receptor rearrangements can differ by only a single nucleotide . Consequently, if the system has a high error rate, it becomes difficult to differentiate true biological variability from sequence error. Second, the contiguous sequence information that can be used to validate the existence of similar genomic sequences is not available for individual lymphocyte receptor clones. Finally, genomic DNA may be obtained in large amounts with invariant sequence. By contrast, lymphocyte receptor samples are difficult to generate reproducibly due to individual variations, cell type variations, and the error-prone processes of reverse transcription (if applicable) and amplification.
Compared to standard genome sequencing, studies of RNA virus quasispecies are more analogous to studies of the antigen receptor repertoire . However, because virologists are primarily concerned with sequence diversity (the degree of sequence difference across a viral population, weighted for abundance), system errors have a relatively low impact on their studies. Indeed, virologists often consider sequence variants below an arbitrary prevalence threshold as artifactual and ignore them, because knowing the true error rate is comparatively unimportant. By contrast, immunologists are primarily interested in what is formally termed sequence complexity (the number of different immune receptor clones). This parameter is exquisitely sensitive to small differences in the sequence error rate. Optimal methods are therefore needed to manage sequence errors in antigen receptor repertoire datasets.
Errors in the reverse transcription or PCR processes can create extra clones that did not exist in the original sample, although these are dwarfed by the error rates of current next generation sequencing technologies. Software that interprets the sequence data can greatly distort the analysis. Overzealous clustering algorithms or high abundance thresholds can minimize true diversity in the samples by ignoring real differences between clones. Conversely, interpretations that are too permissive will treat sequencing errors as true variants, thereby increasing apparent diversity.
How can we improve our ability to differentiate true variants from sequencing artifacts, beyond simple abundance thresholds? As a first step, sequences can be filtered for quality as well as for out-of-frame reads. Further, as suggested by Bolotin et al., the more we know about the rates and types of sequence errors obtained by a particular method, the better we can tailor our analysis. There are several approaches to gaining this information. One approach is to analyze a “gold standard” of known, fixed sequence, whereby variants can be attributed unequivocally to sequence error. For example, Nguyen et al. sequenced cells from TCR transgenic mice . Quality score (Q-score) filtering decreased the percentage of erroneous sequences, but failed to eliminate the error completely; sequence-dependent differences in the efficacy of Q-score filtering were also identified. The accuracy with which low-quality sequences are identified and then assigned to high-quality "core" sequences is a potential limitation of this approach. False positive error prediction can lead to underestimates of true diversity if real low-frequency sequence variants are incorrectly filtered. In addition, low-quality sequences can sometimes map to multiple high-quality core sequences, which can compromise the sequence recovery process. It is also notable that sequencing monoclonal templates can, somewhat paradoxically, alter the performance of certain next generation sequencing technologies by interfering with calibration . Nevertheless, the use of known standards and repeated measurements will be critical for understanding the magnitude, distribution, and variability of sequence errors obtained with different methods, which in turn will be critical for modeling appropriate error correction .
Another approach to the measurement of procedural error rate is to quantify changes in the germline-encoded regions of the somatically variable template, such as the outer parts of the V and J segments distinct from the CDR3 core of the TCR, which should remain invariant after VDJ recombination. However, the assumption that genomic mismatches in these regions represent sequence artifacts is complicated by potential unreported allelic variants. Indeed, functionally relevant antigen receptor gene polymorphisms have already been described , and many more will probably be uncovered by future next generation sequencing studies. The relationship between sequence numbers and input cell numbers, which should form an upper bound for the expected number of observed sequences and also predict the expected frequency of rare sequences in the mixture, is another potential constraint that can be used to reconstruct the repertoire from sequence data. This relationship provides a fairer benchmark for the sensitivity of sequencing methods than spiking experiments using single clones. Bayesian probabilistic approaches for sequence clustering have also been developed that do not rely on previously measured estimates of the error rate . Ultimately, however, improvements in the accuracy of the sequencing technologies themselves and the reliability of the base-calling quality information provided will be necessary to make the task of analysis more tractable. For example, paired-end reads of individual clones can improve the accuracy of Illumina sequencing .
Errors due to the sequencing technologies need to be seen in the context of many other sources of error in a repertoire sequencing experiment (Table 2). A critical and rarely discussed source of error is caused by bottlenecks in the sampling protocol. Is the sample size large relative to the repertoire diversity being measured? Are the samples being compared of equal size? If not, it can be very difficult to make meaningful comparisons of diversity. A related error stems from sample yield during nucleic acid extraction and reverse transcription (if applicable), and the effect of protocol steps that may dilute or split the sample. Even if a cell is included in the original sample, it will not be detected if its rearranged receptor genes do not serve as a template in the PCR reaction.
|Sources of error that decrease apparent diversity|
|Cell sample inadequate or variable|
|Low or variable yield in template gDNA or cDNA molecules per cell|
|High abundance threshold for minority variants|
|Sources of error that increase apparent diversity|
|Mutations introduced through reverse transcription or PCR reactions|
|Approaches to error correction|
|Relate sequence number and frequency to input cell number|
|Measure expected error types and rates by testing known samples|
|Compare similarity to germline gene-encoded regions|
|Filter out poor quality sequences|
|Remove out-of-frame sequences|
Creating PCR protocols to generate products that are truly representative of the starting cell population is a major challenge, whether starting with mRNA or DNA. mRNA is more commonly used for two reasons. First, splicing of the TCR constant region at the mRNA level simplifies amplification strategies because all rearranged receptor genes can be captured with a single primer . Second, mRNA is less complex and multiple copies are present in each cell, thereby making it easier to amplify all rearranged receptor genes from any given sample. The drawback is that the ratio of rearranged receptor genes can be skewed if different cells harbor different numbers of mRNA copies, due either to different levels of transcription or decay . Also, if mRNA is used, information about the number of cells represented in the amplification template is lost; a given amount of mRNA may represent a large amount of mRNA from a few cells or a small amount of mRNA from many cells, potentially leading to distortions of repertoire abundance. By contrast, DNA templates have the advantage of not requiring a reverse transcription step, which can affect yield and introduce sequence errors. In addition, one can infer the number of cells represented in the assay because there is only one DNA copy of a rearranged receptor gene per cell. The downside is the lower abundance of DNA templates. Moreover, the lack of a uniform constant region sequence means that highly multiplexed PCR strategies are required. A major finding of Bolotin et al. is that such highly multiplexed PCR strategies are associated with distortions in the relative abundance of TCR Vβ families as compared with less multiplexed amplification strategies and antibody staining . It is also important to recognize that PCR amplification per se is nonlinear, such that molecular bias in favor of high-frequency templates occurs during the exponential phase regardless of primer composition and starting material. Accordingly, efforts to limit the number of amplification cycles will reduce quantitative distortions as well as error rates.
As sequencing methods improve in accuracy, throughput, and cost, several challenges will confront the use of next generation antigen receptor repertoire datasets. These include the need to improve and validate statistical analyses that estimate full diversity from the number of clones found in the sample. This can be problematic, particularly if large numbers of rare sequence variants are present . A related problem will be the development and validation of a biologically meaningful summary statistic for comparing repertoire diversity between different samples, especially if the samples are of different sizes.
Another challenge will be to integrate antigen receptor sequence data with cell-based measurements of phenotype and function. This has previously been possible on a small scale through Sanger-based sequencing of rearranged receptor genes from cells with defined characteristics sorted either individually or as bulk populations. In the future, a more comprehensive picture of the whole repertoire may become possible through improvements in cytometry and microfluidics. For example, most T-cell repertoire analyses to date have focused exclusively on the β-chain of the TCR heterodimer, which enables clonotype identification due to allelic exclusion at this locus. Microfluidic or emulsion amplification technologies may facilitate the development of high-throughput approaches to paired sequence analysis of TCR αβ heterodimers. Paired sequences would inform studies of the naïve repertoire, illuminate the rules that govern αβ-chain combination, serve as a check for β-chain sequence error correction and ultimately contribute to our understanding of TCR function. Today's informatics approaches do not permit much biological inference from the sequences themselves, although molecular signatures have recently been identified in antigen-specific repertoires that provide some degree of functional insight [28, 29]. The ultimate challenge will be to develop experimental and/or informatics technology to infer the binding activity of immune receptors from their amino acid sequence. This is likely to be more straightforward for immunoglobulins than TCRs, due to their higher affinity interactions with antigen that are not dependent on MHC presentation.
Ultimately, the importance of errors in antigen receptor datasets, the best metrics to measure these errors, and the choice of methods to correct them all depend on the intended use of the data. The initial clinical use of next generation antigen receptor sequencing will be to assay for the presence of residual lymphoid tumor cells . In this case, sensitivity and specificity for the individual clone is the only consideration. Other uses of this technology, to assess the diversity, generation, selection and stability of antigen receptor repertoires, will be less tolerant of sequencing artifacts. For some applications, relative rather than absolute measurements of receptor diversity may suffice. For other applications, accurate determination of expanded clones rather than complete repertoire coverage may be key, which would potentially reduce the necessary sample size and assay sensitivity. As these assays move into the clinic, it will be important to have fully disclosed protocols with validated measurements of accuracy. The work of Bolotin et al. moves us a step closer to that goal.