The combination of manual and automated gene annotation of the first draft assembly of the A. pisum genome, Acyr_1.0, produced a consensus gene set of 34 600 genes (International Aphid Genomics Consortium, 2010). We compared the proteins encoded in this set with those encoded in seventeen fully-sequenced genomes (see Table 1). This species set includes twelve other insects, representing all major insect groups with sequenced genomes, including: representatives from paraneoptera, hymenoptera, coleoptera, amphiesmenoptera, nematocera and brachycera; the crustacean Daphnia pulex; and three non-arthropod out-groups, including the nematode Caenorhabditis elegans and the chordates Ciona intestinalis and Homo sapiens. Our sequence searches revealed that 12 885 genes in the Acyr_1.0 gene set (37%) do not present significant similarity (e-value < 10−3) with genes in other species included in the analysis. This large number of putative species-specific genes might be in part due to high false-positive rates in gene prediction programs. Genes encoded in the A. pisum genome have been predicted using a combination of the NCBI evidence-based RefSeq annotation pipeline, which uses evidence from expressed sequence tag data and protein homology to support a given gene structure, and a combination of ab initio gene prediction programs combined into a single prediction with GLEAN (International Aphid Genomics Consortium, 2010). Of the 34 604 genes in the Acyr_1.0 gene set, 12 251 are based on RefSeq annotation, thus the level of predicted genes based on ab initio approaches (more error prone) is quite high and would be compatible with a high rate of false positives. An abundance of transposable elements in A. pisum might be an additional reason for the high specific gene count. Although several insect genome projects include pipelines to detect transposable elements, these are rarely eliminated from the initial, automatically generated consensus gene sets at least for the gene models that are predicted ab initio. These gene-finding programs usually mask repetitive regions but do predict the protein-coding parts of the transposable elements. These can be eliminated in subsequent annotation phases. The use of our phylome pipeline in the first annotation phase of the genome prevented us from discarding putative transposable elements from the analysis. This could be accounted for in future genome projects, at least for the easily detectable transposable elements families, thus saving valuable time in the phylogenetic computations. Alternatively, as we will discuss below, the phylogenomic pipeline used here, could also serve to help in the identification of transposable elements, since they tend to involve many lineage-specific duplications. Taking into account these considerations, the analyses of aphid-specific genes might identify true genetic specificities of aphids as compared to other insects. Shared gene sets, in contrast, may provide information on the genetic similarities of different organisms. Our, sequence comparison analyses showed that A. pisum shares a range of 30–53% of its gene repertoire with the other insects (Fig. 1). The two species sharing the highest percentage of aphid genes were the wasp Nasonia vitripennis and the beetle Tribolium castaneum (53% in both cases). Interestingly, the closest relative among insects with sequenced genomes, the body louse Pediculus humanus, shares only 38% of the pea aphid genes. This low percentage is probably related to an extreme reduction in the size of the genome of this human parasite (Johnston et al., 2007), since genome size, and not just evolutionary distance, is one of the strongest determinants of shared gene content between related species (Snel et al., 1999).
We subsequently applied a similar pipeline to the one used for the human phylome (Huerta-Cepas et al., 2007) to reconstruct the phylogenies of every single aphid gene, obtaining a total of 23 523 phylogenetic trees and multiple sequence alignments (see Experimental procedures). First, significant hits (e-value < 10−3) that overlapped with more than 50% of the query aphid sequence were selected to reconstruct the phylogeny. Multiple sequence alignments of homologous proteins were obtained with MUSCLE v.3.6 (Edgar, 2004) and then trimmed with trimAl (Capella-Gutierrez et al., 2009) to filter out gap-rich columns. Phylogenetic analyses were performed using Neighbor Joining (NJ) and Maximum Likelihood (ML) approaches as implemented in PhyML (Guindon & Gascuel, 2003) (see Experimental procedures section for more details). The resulting alignments, phylogenies and orthology predictions can be accessed through phylomeDB (http://phylomedb.org) and AphidBase (Legeai et al., 2009) (http://www.aphidbase.com) databases. PhylomeDB is a public database for complete collections of gene phylogenies (phylomes) that allows users to explore the evolutionary history of genes through the visualization of phylogenetic trees and alignments, and to obtain their phylogeny-based orthology and paralogy relationships across a number of species. Since these trees and alignments are generated automatically, it is recommended to inspect the protein alignments to judge the quality of the data. As explained in the methods section, these alignments can be refined and expanded for further analyses.