Next-generation sequencing technology in clinical virology


Corresponding author: M. R. Capobianchi, National Institute of Infectious Diseases INMI, Lazzaro Spallanzani, Via Portuense, 292-00149, Rome, Italy



Recent advances in nucleic acid sequencing technologies, referred to as ‘next-generation’ sequencing (NGS), have produced a true revolution and opened new perspectives for research and diagnostic applications, owing to the high speed and throughput of data generation. So far, NGS has been applied to metagenomics-based strategies for the discovery of novel viruses and the characterization of viral communities. Additional applications include whole viral genome sequencing, detection of viral genome variability, and the study of viral dynamics. These applications are particularly suitable for viruses such as human immunodeficiency virus, hepatitis B virus, and hepatitis C virus, whose error-prone replication machinery, combined with the high replication rate, results, in each infected individual, in the formation of many genetically related viral variants referred to as quasi-species. The viral quasi-species, in turn, represents the substrate for the selective pressure exerted by the immune system or by antiviral drugs. With traditional approaches, it is difficult to detect and quantify minority genomes present in viral quasi-species that, in fact, may have biological and clinical relevance. NGS provides, for each patient, a dataset of clonal sequences that is some order of magnitude higher than those obtained with conventional approaches. Hence, NGS is an extremely powerful tool with which to investigate previously inaccessible aspects of viral dynamics, such as the contribution of different viral reservoirs to replicating virus in the course of the natural history of the infection, co-receptor usage in minority viral populations harboured by different cell lineages, the dynamics of development of drug resistance, and the re-emergence of hidden genomes after treatment interruptions. The diagnostic application of NGS is just around the corner.

Next-generation Sequencing: Principles and Methods

Recent advances in the nucleic acid sequencing technologies, referred to as ‘next-generation’ sequencing (NGS), have produced a true revolution and opened new perspectives for research and diagnostic applications. One of the hallmark features of the NGS technologies is their massive throughput at a modest cost, with hundreds of gigabases of sequencing now being possible in a single run for some thousand dollars.

The first commercially available NGS system, developed by 454 Life Sciences, appeared in 2005. Since then, in a relatively short time, several NGS technologies have been developed (Table 1). According to the different characteristics, the spectrum of application of the platforms may show significant differences. In particular, the systems can be referred to as: high-capacity sequencers, such as the Genome Analyzer, HiSeq sequencers (Illumina, San Diego, CA, USA), the Heliscope platform (Helicos BioSciences Corporation, Cambridge, MA, USA) and 5500 series sequencers, which use SOLiD technology (Applied Biosystems, Carlsbad, CA, USA); and long-read sequencers, such as the Genome Sequencer (GS) FLX or Junior (454 Life Sciences, Roche Diagnostics, Branford, CT, USA), Ion Torrent (Applied Biosystems), and PacBio RS (Pacific Biosciences, Gen-Probe, Menlo Park, CA, USA).

Table 1. Characteristics and principal applications of next-generation sequencing platforms
ManufacturerPlatformMaximum throughput per run (Gb)Maximum sequence length (bp)Template preparation/sequencingPrincipal applicationsError source
  1. emPCR, emulsion PCR.

IlluminaHi Seq 2000600100Solid capture/bridge amplification in situ/reversible chain terminatorsGenome resequencing, quantitative transcriptomics, genotyping, metagenomicsSignal interference among neighbouring clusters, homopolymers, phasing, nucleotide labelling, amplification, low coverage of AT-rich regions
Applied BiosystemsSOLiD1575emPCR/ligation and two-base codingGenome resequencing, quantitative transcriptomics, genotypingSignal interference among neighbours, phasing, nucleotide labelling, signal degradation, mixed beads, low coverage of AT-rich regions
Ion Torrent PGM1200emPCR/real-time sequencing with detection of H+De novo genome sequencing and resequencing, target resequencing, genotyping, RNA sequencing on low-complexity transcriptome, metagenomicsHomopolymers, amplification
Pacific Biosciences/Gen-ProbePacBio RS0.0451200Single-molecule/linear amplification, real-time sequencing, fluorescent nucleotidesDe novo genome sequencing and target resequencing, non-amplifiable samples, PCR-freeLow intensities
HelicosHeliscope3535Single-molecule/reversible chain terminatorsDirect RNA sequencing, non-amplifiable samples, PCR-free, and unbiased quantitative analysesPolymerase, molecule loss, low intensities
454 Life Sciences/Roche DiagnosticsGS FLX+0.71000emPCR/pyrosequencingDe novo genome sequencing and resequencing, target resequencing, genotyping, metagenomicsHomopolymers, signal cross-talk interference among neighbours, amplification, mixed beads
GS Junior0.035400 Target resequencing (amplicons), genotyping 

Current NGS methods use a three-step sequencing process: library preparation, DNA capture and enrichment, and sequencing/detection [1]. In library preparation, DNA (or cDNA) fragments of appropriate lengths are prepared, by either breaking long molecules, or by synthetically preparing short molecules (i.e. by PCR or cloning). In the DNA capture and enrichment phase, these short molecules are labelled with primers that are used to capture and physically separate each single short fragment, fixing it onto a solid substrate. Each single molecule acts as a template for clonal amplification (single-molecule template principle). The sequencing phase is based on DNA polymerization combined with detection. These steps occur concomitantly on myriads of clonally amplified fragments. In some of the platforms, the PCR step is not required, and sequencing is performed on single (Helicos) or on linearly amplified (PacBio RS) molecules. The principles for the sequencing/detection phase are different in the various platforms: pyrophosphate release (pyrosequencing) coupled with optical detection of fluorescence, hydrogen ion release coupled with detection of pH variation by a semiconductor, ligation combined with fluorescence detection, and linear amplification coupled with fluorescence detection. The detected signals are transformed into sequences by an integrated browser and elaborated with specific software, to match pre-established quality scores. Bioinformatic tools for the evaluation and elaboration of the sequence data are in continuous development, according to the various applications.

There are many factors involved in the choice of technology, including cost performance, run time, accessibility, type of application, and convenience. More detailed descriptions of the technical features of NGS methods are given in [2-6].

NGS Applications to Virology

NGS applications to virology have been recently reviewed [7-10]. Some of the topics relevant for clinical application will be briefly described.

Virus discovery (metagenomics)

The term metagenomics designates the analysis of all of the nucleic acid present in a given sample, allowing the exploration of entire communities of microorganisms, and avoiding the need to isolate and culture individual microbial species, and does not need previous knowledge of the sequences.

This new science is one of the fastest advancing fields in biology, and is extending our comprehension of the diversity, ecology, evolution and functioning of the microbial world, as well as contributing to the emergence of new applications in many different areas. Several large projects have been funded in the USA or EU for the definition of the human microbiome based on metagenomics (e.g. the Human Microbiome Project, funded by the NIH, and Metagenomics of the Human Intestinal Tract, funded by the EU).

The continuous and dynamic development of NGS represents an unprecedented opportunity for metagenomic applications, and is expanding our capacity to analyse microbial communities from a variety of habitats and environments [3, 11]. In this respect, unbiased massively parallel sequencing is the term used to designate NGS metagenomic applications.

Metagenomic applications of NGS include the discovery of novel viruses from clinical samples in human and animal diseases, e.g. a new arenavirus involved in transplant-associated disease clusters [12], and the new Ebola virus Bundiubugyo [13], and the identification of a viral aetiology of an outbreak of a disease in honeybees [14]. NGS is also applied for the characterization of the virome, i.e. of the viral community, in the environment [15, 16], in animals [17], and in humans [18-21].

Whole viral genome reconstruction

The reconstruction of full-length viral genomes, even in the case of unknown or poorly characterized viruses, is a common application of NGS, starting either from culture-enriched viral preparations, or directly from clinical samples. The assay design can vary from shotgun metagenomic sequencing of random libraries, to random shotgun sequencing of full genome amplicons, to overlapping amplicon sequencing and assembly.

These approaches have been used to obtain full genome sequencing of pandemic influenza virus [22-26], human immunodeficiency virus (HIV) [27, 28], human herpesviruses [29, 30], and other viruses. Besides the reconstruction of the viral genomes, in most of these studies evaluation of intra-host virus variability has also been achieved, thanks to the high coverage of single-nucleotide positions allowed by massively parallel sequencing (see also next paragraph).

Characterization of intra-host variability

RNA viruses, such as hepatitis C virus (HCV) and influenza virus, and reverse transcriptase-dependent viruses, such as hepatitis B virus (HBV) and HIV, show high intra-host variability. This is the result of high replication capacity and low fidelity of the replication enzyme (ranging from 10−5 to 10−3 substitutions/position/cycle). So, whereas a single founder virus often dominates in the early stages of infection, during the subsequent phases a great number of mutations emerge spontaneously. These mutations are rapidly lost if they lead to replication disadvantage, they may either disappear or progressively accumulate as result of casual drift if they are neutral, or they are rapidly fixed if they confer selective advantage. The highly variable mixture of closely related genomes within a given host, referred to as quasi-species, allows a viral population to rapidly adapt to dynamic environments and evolve resistance to vaccines and antiviral drugs [31]. NGS has been widely used for the characterization of intra-host variability of influenza virus [23, 32], HCV, HIV and HBV. Most of these studies have been based on ultradeep sequencing (UDS) of PCR-generated amplicons spanning the genomic region(s) of interest. The main studies on HCV, HIV and HBV are reviewed below.


Owing to the high replication rate of HCV and to the error-prone nature of RNA-dependent RNA polymerase, many mutations are spontaneously generated every day, and even the existence of two or three mutations in a single genome is highly probable [33]. This phenomenon provides the genetic background for the selection of mutants that are able to escape immunological and therapeutic control. To date, drug-related mutations that confer resistance to direct-acting antiviral agents (DAAs) have been identified at several positions in NS3/4A and in NS5. In the absence of selective pressure (antiviral therapy), HCV variants bearing mutations conferring resistance are generally present at a very low frequency, making detection extremely challenging.

UDS can be used to identify such resistant variants prior to treatment. Studies performed in vivo have shown that multiple mutations pre-exist in the absence of DAA administration, with an intrapatient frequency generally below 1%. [34-36]. In another study, the estimate of the intrapatient frequency of minority resistance mutations was slightly higher [32]. From such studies, a dynamic pattern has emerged, whereby resistance mutations frequently appear, but are rapidly lost in the absence of selective pressure, because of reduced fitness. It has been suggested that not only the number but also the nature of the nucleotide changes can contribute to the genetic barrier to the development of resistance [37]. The dynamics of the emergence of DAA resistance mutations under selective pressure exerted by the new macrocyclic NS3/4A protease inhibitor has been thoroughly investigated with an in vitro replicon model, and this has shown the emergence and disappearance of multiple replicon variants in response to the changing selective pressure [38]. The kinetics of development of DAA resistance in vivo have been studied in a human hepatocyte chimeric mouse model, and this has shown the rapid emergence of resistance mutations after the administration of single drugs [39].

HCV genome variation has also been studied at sites different from DAA targets, such as the PKR-eIF2alpha phosporylation homology domain, and this has shown no correlation between mutations at this site and the response to pegylated interferon and ribavirin [40]. A genome-wide characterization of HCV confirmed that E2, containing hypervariable region-1 and hypervariable region-2, is the most variable region within individual patients, and that the response to pegylated interferon + ribavirin is accompanied by a rapid reduction of complexity in this region, which remains unaffected in non-responding patients [35].

Other studies have addressed intra-host HCV evolution during primary infection, and have highlighted the existence of multiple bottlenecks during and shortly after the transmission event [41, 42], and fine-tuning of the effects of immunological pressure in the subsequent within-host evolution [43, 44]. In addition, UDS has been used to trace HCV transmission among injection drug users [45].

HIV antiretroviral drug resistance

HIV is among the most variable human pathogenic viruses. The main NGS applications to HIV variability include the detection of resistance to antiretroviral drugs and the characterization of genome regions involved in tropism.

Two pioneer studies describing the application of NGS to antiretroviral resistance mutations appeared in 2007 [46, 47]. Subsequently, a number of studies appeared on both naïve and antiretroviral therapy-experienced patients. On the whole, NGS was much more sensitive than Sanger sequencing in identifying the patients harbouring mutations for almost all of the classes of antiretroviral drug, roughly doubling the estimates based on population sequencing. This was principally attributable to low-abundance mutations, which were not detectable with population-based sequencing. Overall, the results of such studies indicate that low-abundance resistance-associated mutations are unfavourable prognostic markers of therapy success, both in naïve [48-50] and in experienced patients [51-53]. Longitudinal analysis has been performed in patients failing antiretroviral regimens, and has shown de novo emergence and recombination between mutant genotypes [54]. In addition, in naïve patients, reverse transcriptase (RT) mutations that are not considered to be intrinsically relevant for drug resistance, and that are not detectable by standard sequencing, such as T69S and L210M, can act as ‘sentinels’ of the presence of minority resistant variants in HIV-1 drug-naive patients [55]. Insights into the transmission of resistance-associated mutations have been provided by other studies [56].

HIV tropism

HIV tropism represents a key factor in disease progression and severity. Several findings indicate that variants using the CCR5 co-receptor (R5 variants) are preferentially transmitted and predominate during the early stages of the infection. CXCR4 co-receptor-using (X4) variants generally emerge at later stages, although they can occasionally be found during primary infection, and constitute a marker of a more aggressive clinical course. Recently, a CCR5 antagonist (maraviroc) has been introduced as an antiretroviral treatment option, with the prerequisite that the HIV strain harboured by candidate patients is not X4. This latter condition may be established by phenotypic tropism testing, or by genotypic methods. The latter methods are based on sequencing of the V3 region of the env gene. The application of UDPS to V3 sequencing for the study of HIV tropism has been widely used for the retrospective investigation of clinical trials based on maraviroc [57-62]. The main results of such studies have shown that UDS is very sensitive in identifying low-abundance X4 variants. The possibility of obtaining quantitative data allows monitoring of the shift of tropism in the viral population during virological failure. A general concept that has emerged from these studies is that minority X4 variants pre-exist and outgrow the R5 variants during treatment in patients in whom maraviroc administration fails without the development of drug resistance mutations. Therefore, such studies have shown that the availability of sensitive and quantitative methods with which to identify rare X4 variants before the start maraviroc administration is crucial, and is easily achieved with NGS.

NGS application to V3 sequencing has also provided insights into viral dynamics. Studies from our group have shown that this approach may unveil new scenarios in HIV pathogenesis. We captured the circulating HIV virions with antibodies against blood cell lineage markers to obtain viral progeny from different host cells, and applied V3 UDPS to the viral genomes obtained from this sorting procedure; with this approach, we showed that proviral reservoirs in the monocyte lineage and CD4 T-cells are the source of rebounding viruses after therapy interruption[63]; this study also showed that minority X4 variants are more frequently detected in monocytes, and that the viral quasi-species hosted by these cells is more complex than that harboured by T-cells. The X4 provirus archived in monocytes was identified as a putative source of replication-competent virus [64]; however, the frequency of X4 variants did not show a net increase within 6 months from therapy suspension [65].

Furthermore, NGS showed that X4 variants are frequently detected in patients with primary infection, accounting for up to 50% of cases, and that viral diversity correlates directly with the intrapatient frequency of X4 variants. Patients with high diversity and high X4 frequency have a more aggressive clinical presentation, and require early treatment [66]. Typical phylogenetic trees based on NGS V3 sequence data from patients with primary infection who needed early treatment or not is shown in Fig. 1, where the presence of polyphyletic X4 variants is only observed in the first case.

Figure 1.

Phylogenetic trees of plasma-associated and peripheral blood mononuclear cells (PBMC)-associated human immunodeficiency virus (HIV)-1 from two representative patients with acute infection who harboured X4 variants (patient A received highly active antiretroviral therapy within 6 months after seroconversion, and patient B remained free of therapy). The phylogenetic trees indicated that X4 variants of patient A belonged to distinct lineages, whereas, in patient B, X4 variants were monophyletic in the context of a star-like phylogeny. The trees were built with the V3 amino acid sequences obtained from UDPS, with the neighbour-joining method based on p-distance and 1000 bootstrap replicates as in MEGA version 4.0 ( The sequence correction pipeline was based on the translation of nucleotide sequences and conservation of only the coding ones, as previously described [62, 63]. The sequences were classified as X4 or R5 by use of the position-specific score matrix (PSSM) algorithm, with the X4R5 matrix for subtype B HIV-1 (

In a recent study based on cloning and sequencing, the degree of HIV diversification was measured in patients from the Swiss Cohort prospectively followed since primary infection [67]. During primary infection, the median viral diversity was low (0.39%), but it was >1% in 11% of the patients. Viral diversity increased with time since infection, but no association was observed between increased diversity and plasma HIV-1 RNA load or CD4 numbers. Most participants harboured R5 strains, and only a small percentage of CXCR4-using viruses within a mixed population were detected in this cohort, in line with our findings.


The application of NGS to the study of HBV resistance was pioneered by our group in 2009 [68]. The presence of minority variants harbouring major drug resistance mutations, observed in this study for the first time in naïve patients, was confirmed by others [69]. Subsequent studies supported the huge power of NGS for investigating the dynamics of drug resistance mutations in the course of treatment, and showed that minority mutations may exist before treatment start or shift, and may expand in response to drug-driven selective pressure [70-72].

One of the striking features of the HBV genome is the overlap between the coding regions (open reading frames) for polymerase and surface glycoprotein (hepatitis B surface antigen (HBsAg)). We have observed a number of changes in the HBsAg open reading frame, several of which led to stop codons, whose frequency ranged between 1% and 3% [68]. In fact, mutated virions that are not capable of coding for correct HBsAg are expected to be strongly impaired, and their maintenance in viral quasi-species should be dependent on the concomitant presence of wild-type virions, as supported by a recent report [73].

The driving force for selecting such variations presumably acts differently on RT and HBsAg. Further studies are necessary to clarify this point, and it is expected that NGS will be extremely helpful in this.


The current revolution in microbiology has been primarily driven by advances in technology and, in particular, by the development of parallel sequencing platforms, which have led to a substantial reduction in costs and a substantial increase in throughput and accuracy. Not only does NGS provide knowledge for basic research, but also it affords immediate application benefits, including improved diagnostics, prognostics and therapy monitoring for many viral diseases.

The need to incorporate these technologies into the everyday thinking of microbiologists, and especially those studying the role that microorganisms play in the environment and in disease development, has been driven by the inability to culture the majority of microorganisms from an ecosystem and the logistics of using a myriad of culture conditions to capture those that can be grown in vitro. To this end, the development of culture-independent methods has provided the scientific community with paradigm-shifting revelations and opened up the world of the microorganism to much more intensive investigation.

Sequence-independent methods are becoming more important for the identification of emerging viruses in a public health context. Viral metagenomics is a relatively new technique that has been used increasingly to identify viruses in clinical specimens. In the diagnostic setting, metagenomic approaches could be used for the systematic analysis of samples collected from patients with unexplained illness, especially in the context of outbreaks and epidemics. As mentioned above, the application of high-throughput NGS methods in viral metagenomics can greatly enhance the chance of identify viruses in clinical samples, including viruses that are too divergent from known viruses to be detected by PCR or microarray techniques.

In addition, the deep investigation of viral quasi-species by NGS is substantially increasing our understanding of the dynamics of viral infections and their interplay with the selective forces acting during transmission bottlenecks, immune and drug pressure, etc.

In addition, the possibility of using barcoding systems for the simultaneous analysis of multiple samples can substantially reduce the cost of genotype resistance testing, and may increase the throughput for large-scale population studies [47].

Therefore, it is highly probable that NGS will also be exploited for clinical application in the analysis of drug resistance for HIV, HCV, and HBV. We can also envisage that fine assessment of HIV tropism by NGS will be useful to obtain prognostic disease markers during early stages of the infection and before starting or switching antiviral treatment.

However, NGS is a relatively recent technique, and several issues must be addressed before it can become fully exploitable at the clinical laboratory level.

In particular, the high cost of the equipment is a major limiting factor. It is likely that the cost will decrease in the future as the technology of deep sequencing develops. Lower-cost systems have already been developed and deployed in advanced clinical laboratories, such as the Junior equipment that will rapidly replace the higher-scale GS platform from Roche.

Other major issues are access to the computational power needed to analyse very large sequence datasets and the need for skilled personnel with proper bioinformatic expertise. These professionals should integrate existing bioinformatics algorithms with high-performance computing solutions. However, although state-of-the-art NGS data analysis tools can provide good precision, continuous improvement is fundamental to solve problems in a number of fields. First, technical errors must be distinguished from real mutations. Indeed, it is known that PCR polymerases typically have error rates of one substitution per 105–106 bases. Furthermore, in the case of 454 GS-FLX titanium pyrosequencing, for instance, the mean error rate can be as high as 1% [74].

As deep sequencing generates millions of reads from a given sample, technology-intrinsic errors might be misinterpreted as mutations or polymorphisms. This requires error identification and correction. Many algorithms have been designed to increase the quality of data (PyroNoise clusters, SHORAH, KEC, ET, etc.), but it is still not possible to obtain data free of error. The use of platforms that do not require PCR amplification prior to sequencing (such as Helicos and PacBio RS) may circumvent some of the problems resulting from the PCR steps, such as primer selection and mutations introduced during amplification. However, due to the lack of PCR, these platforms suffer from low signal strength (Table 1); overall, these methods warrant further validation and investigations of their applicability in virology.

Another problem of data mining is that, at present, there are no perfect alignment tools, and a compromise is generally tolerated between alignment accuracy and time of analysis, where appropriate hardware and software characteristics are crucial.

The persistent problem of DNA contamination and the medical significance of the detection of very low levels of viral nucleic acid will also need to be resolved. This will require close collaboration between different areas of expertise, i.e. bioinformatics and virology, for the development of an efficient pipeline of analysis and for validation of the results.

In conclusion, although the application of NGS to virology is in its infancy, we can expect that its development will not only help to shed light on viral pathogenesis, but will also lead to practical applications in clinical diagnostic laboratories.

Equipment simplification, method standardization and the development of user-friendly bioinformatic tools are urgent needs to be met in order to render the new, potent tool represented by NGS more accessible to clinical virology.


We sincerely apologize to all those colleagues whose important work was not cited because of limitations of space. We are deeply grateful to I. Abbate, B. Bartolini, M. Selleri, S. Menzo, D. Vincenti, M. Carmela Solmone and P. Zaccaro for their contribution to the experimental work performed in our laboratory. This work was partially supported by Grant no. 40H54 to M.R. Capobianchi and Grant no. 40H59 to I. Abbate from Istituto Superiore di Sanità (National AIDS Project), by the Italian Ministry of Health (Fondi Ricerca Corrente and Ricerca Finalizzata) and by the European Union Seventh Framework Programme (FP7/2007-2013) under Grant Agreement no. 278433-PREDEMICS.

Transparency Declaration

None of the authors have a commercial or other association that might pose a conflict of interest.