Ecogenomics: using massively parallel pyrosequencing to understand virus ecology


  • MJR is a professor of plant biology studying plant and fungal virus ecology and evolution; PS is a postdoctoral fellow interested in plant virus biodiversity; GBW is currently a postdoctoral fellow developing new methods in high throughput sequencing; JQ is currently a postdoctoral fellow developing bioinformatics tools; JDW and HL are bioinformaticists developing methods for sequence analysis; FC is an ecologist studying microbial biodiversity in Costa Rica; GS is currently a postdoctoral fellow studying plant biochemistry and genetics; BAR is a professor studying nucleic acid biochemistry and analysis methods.

Marilyn J. Roossinck, Fax: 580 224 6692; E-mail:


Environmental samples have been analysed for viruses in metagenomic studies, but these studies have not linked individual viruses to their hosts. We designed a strategy to isolate double-stranded RNA, a hallmark of RNA virus infection, from individual plants and convert this to cDNA with a unique four nucleotide Tag at each end. Using 96 different Tags allowed us to pool samples and still retain the link to the original sample. We then analysed the sequence of pooled samples using massively parallel sequencing with Roche 454 pyrosequencing such that 384 samples could be assessed per picotiter plate. Using this method we have been able to analyse thousands of plants, and we have discovered several thousand new plant viruses, all linked to their specific plant hosts. Here we describe the method in detail, including the results and analysis for eight pools of samples. This technology will be extremely useful in understanding the full scope of plant virus biodiversity.


Several recent metagenomic studies have analysed prokaryotic viruses in a variety of unexpected environments (Breitbart et al. 2002; Azam & Worden 2004; Short & Suttle 2005; Angly et al. 2006; Hugenholtz & Tyson 2008; Nakamura et al. 2009). In these studies the viruses from the environmental samples were collected by passage through a series of filters. After nucleic acid extraction, shotgun sequence analysis revealed that viruses are abundant in these environments and up to 80% of the sequences obtained did not have any evidence of homologs in GenBank by blast (Suttle 2005). While a great deal has been learned from these studies, none of the viral sequences obtained can be linked to a specific host, limiting further ecological studies. In addition, no studies to date have addressed eukaryotic viruses.

With the advent of new sequencing technologies, such as Roche 454 pyrosequencing, orders of magnitude more data can be generated in a relatively short time. This technology likely will reveal information on novel viral genes that may be useful to gain a deeper understanding of the diversity and distribution of viruses in various environments. With this in mind, we set out to conduct a biodiversity inventory of viruses that links each viral sample to its eukaryotic host in a terrestrial system. We chose plant viruses for this study for several reasons: (i) plants are not mobile, so the location of the host can be precisely determined, and in many cases resampling will be possible for further studies; (ii) plants are easier to work with and do not require the same levels of biosafety that are necessary for studies on animal viruses; (iii) the majority of known plant viruses have RNA genomes that generate double-stranded RNA (dsRNA) at some point in their life cycle. We used the dsRNA, a form of nucleic acids that is generally unique to viruses, to assess RNA virus infection in plants, by converting it to cDNA through a process specific for dsRNA. The resulting cDNA then was amplified with tagged primers that could cross reference each sample to the sequences obtained by pyrosequencing in pools of 24 to 96 uniquely tagged samples. As a result we are discovering thousands of plant viruses that are generally unique, and only distantly related to known viruses. Since our sequences are directly linked to the original plant host we coined the term ‘Ecogenomics’ to distinguish this study from the metagenomic studies from environmental samples.

Materials and methods

Study sites

We conducted these studies at two sites: the Tall Grass Prairie Preserve in northeastern Oklahoma, an area of relatively low plant diversity; and the Area de Conservación Guanacaste in northwestern Costa Rica, an area of extremely high plant diversity.


Ten to twenty grams of plant materials were collected in the field. For each sample the plant was photographed in its surroundings and close-up. A GPS reading was taken where possible. In some cases where the canopy was very dense GPS readings were taken a few meters away and the approximate distance and direction to the plant was noted. Samples were transported to the laboratory and stored at 4 °C until extraction within a few days.

Enrichment of dsRNA

The dsRNA enrichment procedure is a modification of the method published previously (Dodds et al. 1984). The method was modified to use a spin-column, as described in (Márquez et al. 2007). Five grams of plant tissue was flash frozen in liquid nitrogen and pulverized in a mortar and pestle. The resulting powder was transferred immediately to a 50 mL plastic centrifuge tube containing 10 mLs of extraction buffer (0.1 m NaCl; 50 mm Tris pH8; 1 mm EDTA; 1% SDS, and optional ingredients 0.1% 2-mercaptoethanol; 1% dry powder polyvinyl polypyrrolidine) and 10 mLs of TE saturated phenol:chloroform (v:v, 1:1), and shaken vigorously for 10 min. The resulting slurry was centrifuged in a low speed (∼3000 rpm) table-top centrifuge to break the emulsion, and the aqueous phase removed for a second phenol:chloroform extraction. The volume of the final aqueous phase was carefully measured, and absolute ethanol added to a concentration of 16.5%. The resulting mixture was added to a small plastic centrifuge tube (BioRad Econocolumn, catalog # 731-1550) containing about 100 mg of CF-11 Cellulose (Whatman), and mixed thoroughly. The column then was placed into a 12 mL plastic tube (Falcon) and the tube and column placed into a table-top centrifuge and spun at a few hundred rpm for 30 s (the column should be centrifuged at the minimum time and speed required to move the liquid from the column to the tube). The tube was emptied and the column filled with 10 mL of application buffer (0.1 m NaCl; 50 mm Tris pH 8; 0.5 mm EDTA; 16.5% ethanol). The cellulose was resuspended thoroughly and the centrifugation process repeated. After this first wash cycle the samples were washed two to five more times, depending on their viscosity and the ease of liquid removal.

After the final wash, the dsRNA was eluted by mixing the cellulose with 4.5 mL of elution buffer (0.1 m NaCl; 50 mm Tris pH 8; 0.5 mm EDTA), and repeating the centrifugation over a clean 15 mL Corex-type tube. Sodium acetate (0.5 mL of a 3 m solution) was added to the tube, along with 10 mL of ice cold ethanol, and the dsRNA was precipitated at −20 °C overnight. The tube was centrifuged at 10 000 g in a swinging bucket rotor. The pellet was dissolved in 500 μL of NAE buffer (0.3 m sodium acetate, 0.1 mm EDTA, pH 8) and transferred to a microcentrifuge tube. One mL of ice cold 95% ethanol was added and the sample was precipitated at −20 °C overnight. The final dsRNA pellet was dissolved in a volume of 100 μL of 0.1 mm EDTA pH 8 and stored at −20 °C.

Reverse transcriptase reaction

The dsRNA (1 μL) was mixed with 7 μL H2O, 1 μL 10 mm TE (10 mm Tris 7.5, 10 mm EDTA) and 2 μL primer 5′ CCTTCGGATCCTCC N6-12 3′ at a concentration of 20 μm, in a screw-capped tube. The samples were placed in boiling water for 2 min, and subsequently quenched on ice. Eight μL of a mix containing 1 μL (5 units) of Superscript III (Invitrogen), 4 μL buffer and 2 μL dithiothreitol (supplied by manufacturer) and 1 μL dNTPs (10 mm each) was added to each sample, and the samples were incubated on ice for 10 to 15 min before being placed at 50 °C for 1 h.

Removal of primers

After removal from 50 °C 1 μL (1 μg) of ribonuclease A (Sigma, prepared at 10 mg/mL in water and boiled for 10 min) was added to each sample and the samples were incubated at room temperature for 15 min. The samples then were heated to 85 °C for 2 min. Immediately, 100 μL of ‘PB’ buffer (Qiagen) was added and the samples were transferred to a Qiagen PCR purification column. Samples were purified according to the manufacturer’s instructions, with the additional washing step with 750 μL of 35% guanidine HCl. The samples were eluted in 30 μL of a 1:10 dilution of EB.


Samples were amplified individually, in sets of 24 or 96, in an Idaho Technologies Rapid Cycler II, using 1.5 μL of the RT product in a 15 μL reaction. The reactions also contained a final concentration of 1 x buffer M (medium Mg++ buffer, Idaho Technologies), 170 μm dNTPs, 1 μm of a unique 454 Tag primer (Table 1), and 1 unit of Taq polymerase. The amplification program was: 94 °C for 1 min; 65 °C 0 s; 72 °C 45 s, with a slope of 9, followed by 40 cycles of 94 °C 0 s; 45 °C 0 s; 72 °C 30 s, with a slope of 5, and a final 5 min at 72 °C and 5 min at 37 °C. Following amplification, samples were removed from capillary tubes and 5 μL of each sample from a set was mixed to form a pool for sequence analysis.

Table 1.   Four nucleotide tags used to identify individual samples
Tag no.Tag seq.*Tag no.Tag seq.Tag no.Tag seq.Tag no.Tag seq.
  1. *The four nucleotide Tag sequence was followed by the sequence CCTTCGGATCCTCC. Tags 1–24 were used for the 24 Tag set, and 1–96 for the 96 Tag set.


Sequence analysis

Sequencing was performed on a 454/Roche GS-FLX (Margulies et al. 2005) with sample handling essentially as described by the manufacturer, with several modifications that improved sequence reproducibility while reducing labour intensive sample manipulation steps prior to loading (Wiley et al. 2009). The initial modification was the replacement of Qiagen spin column purification after each enzymatic step with Agencourt Ampure Solid Phase Reversible Immobilization (SPRI) beads. The use of SPRI beads over silica mini-columns has two significant advantages. The first is the overall yield is higher when using SPRI beads, at 90–95%, than the mini-columns at 80–85%. Secondly, by varying the volume of SPRI bead suspension mixed with DNA solution, it is possible to selectively purify fragments over 300 bp in size. As shorter fragments preferentially amplify during emPCR this significantly improves read length average and the number of mixed reads in the final pyrosequencing step.

The second modification was the removal of the steps for generating a single stranded library molecule while enriching for molecules containing only A and B adapters ligated to either end. As the molecules with A on both ends will not amplify properly in emPCR and the molecules with B on either end will not enrich post-emPCR and thus this step was deemed unnecessary.

These modifications facilitated the subsequent automation of the library preparation process on a Caliper SciClone ALH with a Twister II plate positioner programmed to add, move, and remove buffers, enzymatic mixtures, and SPRI bead suspensions as well as move the reaction plate to various stations within the robot. This automation allowed for a walk-away process in which no human manipulations are required except for the preparation of the robot and enzymatic mixtures.

Data deconvolution and blast searches

Pools of tagged cDNA sequenced on the 454 were deconvoluted and assembled using a software pipeline consisting of the Perl scripts get_454_pools and split_454_pools. Briefly, get_454_pools collates the data for all runs of the same sample pool together and calls the program split_454_pools that bins each read from the sequenced, pooled sample according to the Tag at the beginning of the read and trims the Tag and primer sequence from the read.

The split_454_pools script directs three assemblies of each sample bin using the 454 GS De Novo Newbler Assembler. Data for the second and third assemblies is produced by calling sfffile automatically to produce sff files with shorter reads than the standard 100 FLX sequencing cycles, at 84 and 63 cycles. The assembled contigs from the triple Newbler assemblies are then assembled with reduced quality values using Phrap.

The split_454_pools script then automatically runs both BlastN and BlastX on the generated contigs against the Genbank non-redundant protein (nr) and nucleotide (nt) databases, respectively. Those contigs that showed no similarity using either BlastX or BlastN were reprocessed using tBlastX against the Genbank nucleotide database (nt). Those contigs that still showed no similarities using tBlastX were searched using BlastX and BlastN against Genbank environmental databases (env_nr and env_nt, respectively). All Blast searches were performed with an Expect (E) value of 0.001. Any remaining contigs with no evidence of homologs underwent conserved domain search using RPS-Blast (Marchler-Bauer et al. 2007).


Isolation of dsRNA

Although most RNA viruses of plants have single stranded genomes, they must generate both plus and minus strands of their genome during replication. It is not clear if this exists as dsRNA in the intact cell, but dsRNA is readily extracted from plant tissue, and can be used as a hallmark of RNA virus infection (Dodds et al. 1984). The dsRNA is much more stable than ssRNA, and hence the extraction procedure does not require the level of rigour needed for working with ssRNA. We have very successfully extracted dsRNA from thousands of plant samples in a minimally equipped field lab in Costa Rica. Some plant tissues are more difficult to extract than others due to the presence of additional compounds such as phenolics, latex, or other secondary metabolites. Occasionally the extraction procedure must be modified by adding larger volumes of extraction buffer and phenol:chloroform. The addition of PVPP can aid in removing phenolics compounds, and 2-mercaptoethanol can reduce problems of oxidation. For some tissue with very high latex content, such as banana, we have used a leaf roller and dripped the pressed extract from fresh tissue into a tube containing the extraction buffer and phenol:chloroform. However, for the vast majority of samples that we have processed the standard procedure works very well.

Converting dsRNA to cDNA

To convert dsRNA to cDNA the RNA hybrid was melted by boiling and primers with a specific primer sequence at the 5′ end, followed by a random sequence at the 3′ end were annealed to initiate cDNA synthesis by reverse transcriptase. This allowed for priming at numerous sites along either strand of the RNA. The optimal cDNA reaction temperature for dsRNA was 50 °C; this favours the single-stranded RNA form. We also found that the products of the cDNA synthesis were relatively short, on average less than 500 nt, when a 6 nt random sequence was used. We analysed primers containing from six to twelve random nucleotides, and found that similar results were obtained for ten or twelve random nucleotides, yielding a product that ran as a large smear between about 300 nt to more than 3 kb. We selected the primer containing twelve random nucleotides for further study.

Removal of primers and amplification of cDNA

After the RT reaction the cDNA products are single-stranded, of varying length, and contain regions of complementarity. They each contain a specific sequence at their 5′ ends that was part of the primer (Fig. 1), whereas the sequence at the 3′ end is unique. To regenerate a primer site at both ends of a dsDNA for amplification, complementary strands of cDNA must anneal and fill in. This results in selective amplification of cDNAs that are complementary, or originated from a dsRNA molecule. However, unless extreme care is taken to remove all of the primers after the RT reaction, any remaining primers with random sequences at the 3′ end could regenerate the primer sites on any DNA present in the sample during the amplification process. Thus, unreacted primers were removed by treatment with RNase A to remove RNA followed by heating to 85 °C to melt off any primers that may still be annealed. The chaotropic reagent (Qiagen kit) was added to the samples while still hot, and the samples were purified on a Qiagen PCR purification column. After the first step, where the DNA binds to the column, a wash step with 35% guanidine HCl was added, as suggested by the manufacturer. Purified RT reaction products were eluted from the Qiagen column in 0.1 × EB, and used for polymerase chain reaction amplification (PCR). The first cycle used a high annealing temperature and a longer extension time, to allow the complimentary strands of cDNA to anneal and fill in. This was followed by 40 cycles of amplification as described in the methods.

Figure 1.

 Schematic outline of the conversion of dsRNA to sequencing-ready DNA. The dsRNA is converted to cDNA by random priming. Antiparallel strands of cDNA anneal to reconstitute the priming sites at the end after the first DNA synthesis reaction. Separate tagged primers are used for each sample.

Tagged primers for PCR

To link each sequence to the original plant sample, sets of 24 or 96 samples were amplified, with each sample having a unique primer containing a 4 nt Tag sequence (Table 1). The samples from a set were then pooled, and sequenced on a Roche 454 GS FLX sequencing machine. Initially we ran 4 pools of 24 samples in four of eight lanes of a single plate. However, the depth of sequencing from these reactions suggested that more samples could be analysed (Table 2). We synthesized an additional 72 primers with unique Tags and pooled 96 samples, which were analysed on one quarter of a picotiter plate, yielding 384 samples per picotiter plate. In our first set of primers we included one primer with the Tag sequence AAGG. This was the only Tag containing any dinucleotides, and it was not found with significant frequency in the pool. We replaced this primer with one without dinucleotides in the Tag, and all subsequent primers were synthesized without dinucleotides in the Tag. The amplification yielded smears of products, which were used directly for sequence analysis.

Table 2.   Analysis of pyrosequencing results
Pool no.No. tags recovered*No. nts/poolReads/pool†Avg. read lengthContigs/pool‡Avg. contig length‡Contigs/pool§Avg. contig length§No. contigs with virus hit BlastN¶No. contigs with virus hit BlastX**No. contigs with no hits††
  1. *The number of Tags recovered when 24 Tags (first four pools) or 96 Tags (next four pools) were used.

  2. †The total number of reads recovered for each pool.

  3. ‡Contigs formed using the Newbler contig assembly.

  4. §Contigs formed using the triple-Newbler assembly.

  5. ¶Number of triple-Newbler assembly contigs in the pool with hits to viral sequences in GenBank by BlastN.

  6. **Number of triple-Newbler assembly contigs in the pool with hits to viral sequences in GenBank by BlastX.

  7. ††The number of triple-Newbler assembly contigs in the pool with no similarity to sequences in GenBank.

12413 561 70056 892238.41100415.6519649.1123934
22318 525 73578 955234.61237380.43694523.362695
32318 064 63875 874238.11356387.32718569.3264246
42316 510 62671 677230.31182475.69611726.3123739
69415 275 93661 574248.12677434.41808558.6285483
119517 974 48671 733250.62682402.11529566.1225753
159629 222 630101 695232.23744412.522180583.41265174
169527 797 677125 032213.13177410.191994568.3235553

Sequence analysis

Since the amplified products were generally under 1 kb, it was not necessary to shear the samples prior to sequencing. This was also necessary to maintain the Tag on the ends of the PCR products. Using the four pools of 24 Tags we obtained results for 93 uniquely tagged samples (one primer only worked in one of the four pools). This yielded 58 193 206 nucleotides in 268 124 reads. The average read length was 233.4 nts. We then switched to using pools with 96 Tags. In the 33 pools sequenced so far using the 96 Tags we have obtained an average of about 22 × 106 nucleotides per pool, 94 600 reads per pool, and average reads of about 240 nt. Detailed results from a subset of these pools are shown in Table 2. Some variation was found in the depth of sequencing per pool and per sample. This is to be expected from field samples, where the starting material in each sample will vary considerably with respect to quality and quantity of RNA. Raw data from individual pools from Table 2 were submitted to the MG_RAST database with ID numbers: 1 4445798.3; 2 4445799.3; 3 4445811.3; 4 4445803.3; 6 4445804.3; 11 4445815.3; 15 4445806.3; 16 4445807.3.

Triple contig generation

Using the standard Newbler assembly software we obtained about 1200 contigs per pool of 24 Tags, and 3000 contigs for pools of 96 Tags. The contigs were about 400 nt in length. Using the newer triple-Newbler assembly plus Phrap assembly, we obtained fewer contigs per pool, with significantly increased average lengths (Table 2). Non-viral sequences predominantly were similar to plant sequences, although some had similarity to bacterial or fungal sequences. In many pools the number of contigs with no similarity to anything in GenBank were more than the number with viral similarity. Since the sequence of a vast majority of plant genes have been determined, and novel genes without homologs are rare, it seems likely that many of the contigs without similarity were from unknown viruses.

Blast searching

All contigs were searched against the non redundant (nr) database of Genbank, using BlastN and BlastX. A majority of the reads were most similar to plant sequences, followed by bacterial sequences, fungal sequences and viral sequences (Table 3). Only 30% of the samples had similarity to viruses with BlastN whereas, as expected, the samples with similarity increased with BlastX. Samples without similarity by BlastN or BlastX were assessed by tBlastX, which marginally increased the number of sequences with viral similarity. In all about 70% of the samples had sequences with the highest similarity to viral sequences (Table 4). However, in most cases similarity was only high enough to estimate that the virus in the sample belonged to the same family as the virus sequence in GenBank (Table 5). We found very few viruses with similarities high enough to be considered a strain of a known species. The distribution of plant virus families shown in Table 5 represents the results from the eight pools described, and probably is not representative of the entire study.

Table 3.   Distribution of pyrosequencing reads
Pool no.No. plant reads*No. fungal reads†No. bacterial reads‡No. viral reads§No. unknown reads¶
  1. *The number of reads with highest similarity to a plant sequence.

  2. †The number of reads with highest similarity to a fungal sequence.

  3. ‡The number of reads with highest similarity to a bacterial sequence.

  4. §The number of reads with highest similarity to a viral sequence.

  5. ¶The number of reads with no significant similarity to anything in GenBank.

133 381549612 4239884604
235 090235122 807399514 712
335 243529619 822175013 763
434 785817918 83519747904
628 42776884780279617 883
1131 570973011 636219216 605
1555 3888058633399330 923
1672 34219 11916 272146415 835
Table 4.   Incidence of virus infection in plant families
Plant familiesNo. sample*No. sample infected†No. contigs with viral hits‡Avg. contig length§Total reads¶
  1. *The number of samples from each plant family that were included in the 8 pools described here.

  2. †The number of these samples that had contigs with significant similarity to known viruses.

  3. ‡The number of contigs with highest similarity to viral sequences.

  4. §Average length of contigs with viral hits.

  5. ¶Total number of reads from viral contigs.

  6. **Fabaceae is divided into its three subfamilies, Caesalpinaceae, Mimosaceae and Papilionaceae.

Caesalpinaceae (Fab.)**402931611.21152
Mimosaceae (Fab.)**604147543.62278
Papilionaceae (Fab.)**6341501,884.02956
Table 5.   Distribution of virus families
Virus families*No. samples infected†No. of contigs‡Avg. contig length§Total reads¶
  1. *The family of the viral sequence with the highest level of similarity to a contig.

  2. †The number of samples in these eight sample pools with hits to viral sequences of a particular family.

  3. ‡The number of contigs formed in the eight pools with highest similarity for each virus family.

  4. §Average length of the contigs with highest similarity to the virus family.

  5. ¶Total number of reads with top hits to a particular virus family.



Metagenomic studies have been very valuable in directing the rethinking of the global ‘virome’, i.e. there are orders of magnitude more viruses in nature than previously anticipiated, but they have not been able to link any viruses found in environmental samples to their hosts. Ecogenomics, as described here, can fill this gap in our understanding. In addition, almost all metagenomics studies of viruses have characterized bacterial viruses, while the methods described here give us a way to analyse eukaryotic hosts and their viruses. However, the sample processing for this type of study is much more labour intensive than what is used in metagenomics, and hence ecogenomics can simply give a different perspective on the global virome. We also have collected a large amount of sequence data for the hosts, which has yet to be mined. In spite of the relatively high levels of ‘background’ sequence however, we were able to obtain very large numbers of reads for viruses, and even more reads with no similarity to any sequences in GenBank that are likely to be components of novel viruses.

A few other related methods have been published for pooling samples for massively parallel sequencing (Binladen et al. 2007; Hamady et al. 2008). Here however we combine the entire process, from collecting samples in the field to the initial bioinformatics analyses of the resulting sequences. These methods should have broad application in related areas of study.


We thank Adelina Morales, Rosa Maribel Morales Martinez, and Adrian Guadamuz for technical assistance. This work was supported by: the National Science Foundation grant number EF-0627108 and EPS-0447262; The United States Department of Agriculture grant number OKLR-2007-01012; the Samuel Roberts Noble Foundation; and the Area Conservación Guanacaste.

Conflicts of interest

The authors have no conflict of interest to declare and note that the funders of this research had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.