Barcoding of aphids (Hemiptera, Aphididae and Adelgidae): proper usage of the global data set



Basics of DNA barcoding suppose the creation and operation of an extensive library based on reliably (including possibility for validation) identified specimens. Therefore, information concerning morphological identification of the individual samples used for DNA barcoding, for example, identification keys and descriptions used, must be clearly explained. In addition, the maximum available data set of sequences must be used. Access to currently private data appears to be of special interest, especially when such possibility is provided by the database regulations, because it encourages the cooperation of research and saves both time and resources. The cryptic aphid species complexes Aphis oenotherae-holoenotherae and Apomi-spiraecola are used to illustrate the above statements.

DNA barcoding has been proposed as an approach to the characterization of life forms. It is a promising method for exploring biodiversity and species identification based on a reference library of voucher specimens identified by taxonomic experts. The compilation of a DNA barcode library for aphid (Hemiptera, Aphididae and Adelgidae) species carried out in the framework of the Barcode of Life Data Systems (BoLD) is a special example of the applicability of barcoding to explore diversity in aphids (Foottit et al. 2008, 2009).

DNA barcoding is of particular importance for reassessing the status of intraspecific forms and detecting cryptic species (Hebert et al. 2003; Hajibabaei et al. 2007; Valentini et al. 2008). For this purpose, numerous samples representing most of the distribution area of the analysed species complex are needed. Therefore, the ability to update the DNA barcoding data set is critically important as new samples are acquired.

Lee et al. (2011) reported on the substantial updating of the aphid (Hemiptera: Aphididae) data set based on the material from the Korean Peninsula. The authors also examined the effect of adding Korean samples on the previously reported intraspecific genetic divergence (Table 3 in Lee et al. 2011). One might expect such an updated database to be of great value when pondering cryptic species complexes. Unfortunately, authors seem to underuse available data on certain groups of potentially cryptic species. One example is the species complex of the genus Aphis (Bursaphis) inhabiting evening primroses (Oenothera spp.). For the present, two morphologically indistinguishable species are reported as inhabiting Oenothera. Aphis (Bursaphis) oenotherae Oestlund colonizes gooseberries and currants in North America, migrating for the summer to Onagraceae, including Oenothera. Aphis (Bursaphis) holoenotherae Rakauskas is holocyclic on Oenothera in Europe (Blackman & Eastop 2011; for more details see Rakauskas 2008). Partial CO-I sequences of all Aphis samples collected from Oenothera in Palaearctic region, including Korea and Japan, ranged from 0.00% to 0.16% (Rakauskas et al. 2011). Therefore, the intraspecific divergences of CO-I barcoding region of A. oenotherae based on additional samples (Lee et al. 2011) were expected to be of particular interest for this case. Yet, they appeared of little use for two reasons.

First, no mention of the aphid morphological identification keys used is made. Therefore, it appears impossible to validate the reliability of morphological identification of the A. oenotherae samples used for DNA barcoding, because different keys will give different results (e.g. Rojanavongse & Robinson 1977; Remaudière 1993; Blackman & Eastop 2006; Rakauskas 2008).

Second, only two samples of A. oenotherae were exploited when calculating the intraspecific divergences of this species (Table 3 in Lee et al. 2011): one from Korea (GenBank Accession No GQ904116) and the second from Hawaii (EU701472). In addition to these two, BoLD public record for A. oenotherae contains ten partial CO-I sequences (DQ418813DQ418817, DQ418835DQ418839, including No DQ418838 from Korea) that were publicly available before March 09 2010, when the manuscript was initially submitted by Lee et al. (2011). Consequently, twelve specimens might have been included in the analysis instead of two used by Lee et al. (2011). When compared with the standard barcoding fragment of partial CO-I sequence (654 bp, represented by EU701472), the remaining ten sequences (DQ418813DQ418817, DQ418835DQ418839, 592 bp length) contain the fraction overlapping with the barcoding region from the 416 position. The final alignment of these 12 sequences contained 239 positions, and three sequences, DQ418813, DQ418814 and GQ904116, showed nucleotide substitutions at sites matching 465, 475 and 654 positions of EU701472, respectively. Kimura two-parameter distances calculated for this data set with MEGA 5 (Tamura et al. 2011) showed pairwise intraspecific sequence divergences ranging from 0.00% to 0.84%. This contradicts information of the Lee et al. (2011) (p. 35, also Table 3) that sequences of A. oenotherae are identical. One can argue that such a small difference is meaningless. Nonetheless, even low divergences of partial CO-I sequence (e.g. range 0.17–0.34% between Megoura species, see Table 2 of Lee et al. 2011) might indicate different species. In addition to the above-mentioned data set, BoLD record for Aoenotherae indicates 22 more specimens with barcodes that are not included to the public record for the present and were therefore not used in distance analysis by Lee et al. (2011). According to the BoLD record, many of Aoenotherae samples are of North American origin and might be of particular interest for this case, because the type locality of A. oenotherae is Minnesota (Oestlund 1887). Samples from currants and gooseberries (the known winter hosts of A. oenotherae) appear of special interest when compared with those collected from Oenothera spp. BoLD offers the possibility to apply for permission to access private data (see BoLD Handbook, The analysis of all sequences, both public and private, would have presented much more persuasive information on the genetic diversity within Aoenotherae species complex when compared to that given by Lee et al. 2011 (2 sequences, Table 3 therein).

The same holds for the potentially cryptic aphid species Aphis spiraecola. Lee et al. (2011) indicate 51 sequences of A. spiraecola involved in their intraspecific distance analysis (Table 3 of Lee et al. 2011), although 15 samples of this species can be found in their Supplementary Tables 1 and 2. In addition to these fifteen (EU701492-EU701504, GQ904113 and DQ499028), BoLD record for A. spiraecola contains 43 partial CO-I sequences (FJ998563FJ998605) that were publicly available before the initial submission of Lee et al. (2011). Distance analysis (Kimura two-parameter distances) of the entire public available data set from the BoLD record for A. spiraecola showed that average intraspecific divergence of partial CO-I sequence was 0.15%, ranging from 0.00 to 3.13% (see also Fig. 1) while Lee et al. (2011) reported mean intraspecific sequence divergences of 0.18% (0.00–3.43%). Sequence data of other species complexes also appear to be underused. Fourteen sequences of A. fabae were analysed by Lee et al. (2011) to evaluate intraspecific divergences, while 124 specimens with DNA barcodes are recorded in the BoLD. The same concerns exist for Myzus persicae (14 sequences of 102 DNA barcodes), Aulacorthum solani (10 sequences of 34 DNA barcodes) and Aphis glycines (24 sequences of 45 DNA barcodes).

Figure 1.

A neighbour-joining tree (Kimura two-parameter distances) using 58 partial cytochrome c oxidase subunit I sequences of Aphis spiraecola and 4 of Aphis pomi that were public available before submission of Lee et al. (2011). Asterisked are those used by Lee et al. (2011). This tree was constructed using MEGA 5 (Tamura et al. 2011).

In conclusion, once the basic principles of DNA barcoding suppose the creation and operation of an extensive library based on reliably (including possibility for validation) identified specimens, certain procedures should be carefully followed. Necessary information concerning the reliability of morphological identification of the individual samples used for DNA barcoding must be clearly presented in materials and methods of respective publications. The maximum available data set of sequences, including those of the closely related cryptic species, must be used. Access to currently private data appears to be of special interest, especially when such possibility is provided by the database regulations, because it encourages the cooperation of research and saves both time and resources.


We wish to thank Dr. Andrew S. Jensen for the language correction and three anonymous reviewers for their helpful suggestions on the manuscript.

The ideas presented here come from both R.R. and J.B. R.R. drafted the text, J.B. generated the figure and also provided additional text and comments.