• Emerging infectious diseases;
  • high-throughput sequencing technology;
  • novel influenza A (H1N1) virus;
  • Solexa system;
  • viral detection


  1. Top of page
  2. Abstract
  3. Transparency Declaration
  4. References

Clin Microbiol Infect 2011; 17: 241–244


The detection of emerging infectious diseases has been a continuing concern, especially with the novel influenza A (H1N1) viral pandemic of 2009. In the present study, we validated a ‘second-generation’ parallel sequencing platform for viral detection in swab samples collected during recent influenza virus infections in Beijing. This operation yielded millions of valid reads per sample and resulted in an almost complete spectrum of nucleotide information. Importantly, novel A (H1N1) and seasonal A (H3N2) influenza virus-derived sequences were detected without prior knowledge or use of genetic information in advance, suggesting that this approach could be a valuable tool for diagnosing emerging infectious diseases.

The detection of emerging infectious diseases, such as severe acute respiratory syndrome, highly pathogenic avian influenza H5N1 and novel influenza A (H1N1), has become a continuous public health concern. Thus, the accurate diagnosis of pathogens in samples is becoming increasingly important. Currently, nucleic acid amplification tests (NAATs), such as PCR, nucleic acid sequence-based amplification, loop-mediated isothermal amplification and DNA microarray, are used with greater frequency in preference to traditional culture- or antigen-based diagnostic procedures, primarily as a result of their greater sensitivity. However, almost all diagnostic NAATs require viral genome information in advance, and thus cannot be used to detect or characterize novel or unexpected viral infections. Likewise, these methods are completely ineffective if a viral genome has evolved sufficiently to result in point mutations at primer binding sites [1,2].

New sequencing technologies referred to as ‘second-generation,’ such as 454 (Roche, Mannheim, Germany), Solexa (Illumina, Inc., San Diego, CA, USA) and SOLiD (Applied Biosystems, Foster City, CA, USA), show promise for the unbiased detection of pathogens. These systems allow researchers to obtain millions of sequences in a single round of operation, as well as an almost complete spectrum of nucleotide information through comparison and analysis with relevant databases. Because the 454 system has previously identified viral pathogens in nasal and faecal specimens [3], we tested the ability of Solexa, another unbiased high-throughput sequencing platform, to detect viral infections in clinical specimens without using any viral genetic information in advance.

In the present study, four patient swab samples were obtained during novel influenza A (H1N1) outbreaks (samples 1, 2 and 3) and seasonal influenza virus A (H3N2) infections (sample 4) in May 2009 from the Laboratory of Beijing CDC, China. One swab sample from a healthy individual (sample 5) was used as a negative control.

Total RNA was isolated from each sample using the QIAamp RNeasy minikit (Qiagen, Hildenberg, Germany). cDNA was synthesized using the M-MLV RTase cDNA synthesis kit (Takara Bio Inc., Otsu, Japan), and fragmented by nebulizers (Illumina, Inc.) to <800 bp. The overhangs resulting from fragmentation were converted into blunt ends using T4 DNA polymerase and Escherichia coli polymerase I Klenow fragment. After an ‘A’ base was added to the 3′ end, the adapters were ligated to the ends of the DNA fragments. Unligated adapters were then removed, after which the genomic DNA library was obtained by PCR (primer F: 5′-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT-3′; primer R: 5′-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT-3′). Sequencing and sequence analysis were performed using the SBS sequencing kit v3 and the Genome Analyzer, respectively (Illumina, Inc.).

The quality of the sequencing reads from each sample was ensured by the parallel sequencing of in-house samples, and also included the removal of reads with unrecognized sites or sequencing adaptors, and duplicate reads. Valid reads were screened for host contaminants against the human reference genome using eland software, a part of the Solexa analysis pipeline [Cox AJ, unpublished data]. The remaining reads were searched among the microbes subset (viruses, bacteria, fungi and protozoa) of the RefSeq database [4] for taxonomy classification using maq software [5].

A summary of the sequencing data obtained using the Solexa system is shown in Fig. 1. The relative ratio of reads matched with the RefSeq database to the total valid reads was 5–17% for each sample. As expected, the ratio of reads matched with human mRNA was in the range 39––56% because total RNA, and not just pathogenic RNA, was isolated from the swab samples. The other 19–44% reads matched with neither host mRNA, nor the RefSeq database, most likely because the RefSeq database mainly focuses on representative strains of taxonomically established organisms.


Figure 1.  Summary of sequencing reads using the Solexa system. Valid reads were used for blastn search analyses after removing duplicate reads and reads with unrecognized sites or sequencing adaptors.

Download figure to PowerPoint

Sequences corresponding to hundreds of microbes in the swab samples were characterized directly from the sequencing reads matched with the RefSeq database. The fully-characterized microbes are shown in Table 1. The ratio of bacteria was the highest at 63.48–89.65% of total valid reads per sample. The ratio of fungi, protozoa and viruses were between 2.14–14.34%, 3.57–31.71% and 0.13–0.53%, respectively.

Table 1.   Summary of microbe sequence analysis in swab samples
Sample number12345
Total reads298 249331 249580 424176 489158 327
Fungi42 779 (14.34%)21 572 (6.51%)58 245 (10.04%)7655 (4.34%)3396 (2.14%)
Bacteria239 804 (80.41%)296 948 (89.65%)485 943 (83.72%)112 037 (63.48%)131 617 (83.13%)
Protozoa14 733 (4.94%)11 820 (3.57%)35 489 (6.11%)55 970 (31.71%)22 476 (14.20%)
Virus933 (0.31%)909 (0.27%)747 (0.13%)827 (0.47%)838 (0.53%)
Influenza A viruses44321022540
 Novel A (H1N1)443210200
 Seasonal A (H3N2)0002540
Murine leukaemia virus3228233736
Mapping to influenza A virus  Genome cover rate (%)10.398.3917.7944.560

Sequence analysis of all identified viral reads was performed to characterize any additional sequences. As shown in Table 1, a blast search indicated that 44 (sample 1), 32 (sample 2) and 102 (sample 3) reads of the three samples were novel A (H1N1)-derived, strongly indicating that these patients were similarly infected with novel A (H1N1) virus, consistent with previous RT-PCR diagnostic results (Genebank accession number: GQ183617GQ183624) [6]. Similarly, 254 reads of sample 4 were derived from seasonal A (H3N2) virus according to blast searches. Influenza virus was not detected in the control sample (sample 5). In addition, several common human virus–derived sequences, including adenovirus and herpesvirus were detected in all samples [7]. Maybe as a result of the addition of M-MLV reverse transcriptase during samples preparation, sequences corresponding to murine leukaemia virus were detected.

Nucleic acid sequence-based methods have been used extensively for the diagnosis of viral infection in recent years. However, most methods depend on prior knowledge of pathogen sequences, rendering them ineffective for detecting unexpected or mutated viral sequences. This limitation may be overcome by the use of new high-throughput sequencing strategies. In the present study, we demonstrate the potential of our unbiased pangenomic approach for identifying pathogenic viruses directly without advance genetic information. cDNAs, as templates for the Solexa platform, were prepared by random primer using RNAs extracted from clinical samples. The system produced >5.908 million reads per run within 48 h and enabled us to obtain an almost complete spectrum of nucleotide information from each sample, including bacteria, fungi, protozoa and viruses. Upon further analysis of identified viral reads, influenza sequences were present in 32–254 of the 158 327–580 424 valid reads in each sample (Table 1). Importantly, the coverage rates were in the range 8.39–44.56%, which is sufficient for viral subtype identification. Novel A (H1N1) and seasonal A (H3N2) viruses were detected in the absence of prior genetic information, consistent with the classical RT-PCR diagnostic method. This suggests that this approach can contribute to the detection of unexpected or mutated virus by direct comparison with mutant and wild-type viral sequences in the RefSeq database. Likewise, pathogens eliciting similar symptoms could be distinguished in a single sample through analysis of unbiased high-throughput sequencing results, enabling the accurate diagnosis of mixed infection.

Transparency Declaration

  1. Top of page
  2. Abstract
  3. Transparency Declaration
  4. References

The study was supported in part by grants from the ‘AIDS and viral hepatitis and other major infectious diseases prevention and control’ technology projects (2009ZX10004-102), the National Basic Research Program of China (2010CB534003 and 2005CB522905), as well as by an intramural grant from the Institute of Pathogen Biology (2008IPB010). The authors declare that there are no competing interests.


  1. Top of page
  2. Abstract
  3. Transparency Declaration
  4. References