Claude Rispe, INRA, UMR1099 BiO3P, Domaine de la Motte, F-35653 Le Rheu, France. Tel.: + 332 23 48 51 53; fax: + 332 23 48 51 50; e-mail: email@example.com
To study gene repertoires and their evolution within aphids, we compared the complete genome sequence of Acyrthosiphon pisum (reference gene set) and expressed sequence tag (EST) data from three other species: Myzus persicae, Aphis gossypii and Toxoptera citricida. We assembled ESTs, predicted coding sequences, and identified potential pairs of orthologues (reciprocical best hits) with A. pisum. Pairwise comparisons show that a fraction of the genes evolve fast (high ratio of non-synonymous to synonymous rates), including many genes shared by aphids but with no hit in Uniprot. A detailed phylogenetic study for four fast-evolving genes (C002, JHAMT, Apo and GH) shows that rate accelerations are often associated with duplication events. We also compare compositional patterns between the two tribes of aphids, Aphidini and Macrosiphini.
Natural selection (Darwin, 1859) guides the adaptation of organisms to their environment. Historically, most of the focus of adaptation studies has been put on analyses of phenotypic variation. The recent enormous increase in sequence data now allows traces of selection to be directly detected at gene level, without an a priori on the gene(s) involved. This approach can be particularly efficient when complete genomes of related species are analysed, since patterns in sequence divergence can be surveyed and can help detect genes that show atypical evolutionary patterns. In particular, evolutionary rates can show if a gene evolves in an accelerated fashion, e.g. if changes of sequence are promoted by selection (positive selection). While most such comparisons have been made using whole genomes, this data is not always available and, in the present study, we use partially reconstructed sets of coding sequences from expressed sequence tag (EST) projects in different aphids. A broad comparative survey of the gene repertoires of related organisms, and of the evolutionary rates of genes, may indeed bring insights into the genes and functions that are particularly significant at the biological level for that group of organisms.
Aphids (Insecta: Hemiptera) are small insects that feed on plant sap. Most extant families appeared in the Cretaceous period (80–150 Ma) (Von Dohlen & Moran, 2000). Phenotypic plasticity (of reproductive mode and of dispersal) permits them to rapidly adapt to environmental changes (Moran, 1992). Some species are adapted to crops and are pests. They can affect plant growth directly by feeding but also as a vector of plant viruses (Dixon, 1998). Their effect on crops is enhanced by host-plant specialization (Hufbauer & Via, 1999; Hawthorne & Via, 2001) and their rapid demographic increases as a result of viviparous clonal reproduction (parthenogenesis). Recently, the genome of the pea aphid, Acyrthosiphon pisum (Aphidinae, Macrosiphini), has been completely sequenced by the International Aphid Genomics Consortium (2010); it comprises close to 34 000 predicted genes.
With regard to other aphid species, different collections of ESTs have been obtained in Myzus persicae (tribe: Macrosiphini) (Figueroa et al., 2007; Ramsey & Al, 2007) and also in two species of the Aphidini tribe, Toxoptera citricida and Aphis gossypii (Hunter et al., 2003). Most of these ESTs have already been described in order to identify organ-specific expression patterns and, in particular, the genes involved in feeding. While genomic 3comparisons between close species have been made in other insect groups, like Drosophila (Zdobnov et al., 2002; Heger & Ponting, 2007), this has not yet been undertaken for aphids. The genome of A. pisum has been compared with that of evolutionarily distant insects (orders: Diptera, Coleoptera, Hymenoptera, Lepidoptera, Phthiraptera) (International Aphid Genomics Consortium, 2010) and aphid ESTs have been compared with D. melanogaster (Sabater-Munoz et al., 2006; Gauthier et al., 2007) but never between aphid species on a large scale.
Here we compare the coding genomes of different aphid species comprising two relatively distant tribes (Macrosiphini and Aphidini) and characterized by different life-cycle and host-plant preferences: A. pisum, M. persicae, A. gossypii and T. citricida. For these comparisons, we used the pea aphid genome and all ESTs available for the three other species. We reconstruct sets of coding sequences (CDSs) and compute sets of putative orthologues between aphid species. We quantify the fraction of genes shared by different aphid species but unknown from other insects, which thus could play a special role in the biology of aphids. We also analyse the patterns of divergences among putative orthologues, and focus especially on fast-evolving sequences, which could be so as a result of positive selection and rapid adaptation to environmental changes, or strong co-evolutionary interactions such as those between insects and their host-plant.
Unique transcripts catalogues based on expressed sequence tags
Expressed sequence tags were retrieved from GenBank: 27 686 for M. persicae, 8344 for A. gossypii and 4304 for T. citricida. Tentative unique transcripts were obtained for each species. The redundancy (defined as one minus the number of ESTs forming singletons and contigs/total number of ESTs) ranged between 0.457 for T. citricida and 0.657 for M. persicae. Despite the redundancy of ESTs, 48–80% of contigs in these sets were composed of only one EST (singlets). A large fraction of unique transcripts did not have a hit in the Uniprot database (43% to 63%); either these sequences corresponded to genes unique to aphid species, or they represented non-coding regions (untranslated transcribed regions or UTRs) (Whitfield et al., 2002; Sabater-Munoz et al., 2006).
Identification of coding and non-coding sequences
More than 99% of unique transcripts were retained after filtration of mitochondrial and Buchnera contaminants. We predicted a CDS in the majority of sequences with a hit in Uniprot, but this failed in up to 25% of these sequences (for T. citricida). This concerned essentially short sequences (<1000 bp), containing a short coding region or with several frameshifts. For sequences with no hit in Uniprot, the prediction program could generate CDSs, but at a lower frequency (Table 1). Overall, we could reconstruct 1183, 2287 and 6652 different CDSs for T. citricida, Aph. gossypii and M. persicae, respectively.
Table 1. Numbers of Unique Transcripts and predicted coding sequences obtained for three species of aphids
Hit in Uniprot
No hit in Uniprot
# unique transcripts
# unique transcripts
# unique transcripts
Reciprocal best hits detection
We detected 4649 reciprocal best hits (RBH) between A. pisum and M. persicae, 1789 between A. pisum and Aph. gossypii, 982 between A. pisum and T. citricida (Table 2). In addition, 259 genes were identified as RBH among the four species, A. pisum, M. persicae, T. citricida and A. gossypii.
Table 2. Numbers of reciprocical best hits between pairs of species
Most RBH genes (in pairwise analyses or shared by the four species) are sequences that have a hit in Uniprot. Interestingly however, in the set of 259 genes four species-RBH, there were 20 genes with no hit in Uniprot (7.7%). These genes are thus probably unique to aphids as a group. The mean CDS sequence length for four species-RBH were A. pisum= 800, M. persicae= 667, Aph. gossypii= 572, T. citricida= 482. For T. citricida, which has the lowest number of ESTs, reconstructed genes tended more often to be partial sequences and thus shorter.
Mean distances for reciprocal best hits shared by the four aphid species
We considered two levels of species divergence to examine the distribution of genetic distances in the sets of RBH: within tribes (Macrosiphini or Aphidini) and between tribes (Macrosiphini vs. Aphidini). In the former case, we thus examined all distances between A. pisum and M. persicae (Macrosiphini) and all distances between T. citricida and A. gossypii (Aphidini). In the latter case, we compiled all distances between any of the two Macrosiphini and any of the two Aphidini (combining four pairwise comparisons). The distributions of synonymous distances are shown in Fig. 1. Within each tribe, Macrosiphini or Aphidini, there was a small category of outlier sequences (4 and 3, respectively) with a very high value of dS. As these RBH may not correspond to orthologues (see Discussion) we considered a threshold of dS = 0.6 inferred from the distribution curve, which appears to be a likely upper limit for orthologues in this group. In all subsequent analyses, only RBH genes with dS < 0.6 within Macrosiphini or Aphidini were retained.
For the T. citricida/A. gossypii RBH, the median of synonymous distances (Table 3) was 0.112 while for M. persicae/A. pisum, it was 0.221, showing that the two Macrosiphini are more distant from each other than the two Aphidini. As expected, the median of distances between Macrosiphini and Aphidini RBH was higher, between 0.339 and 0.397; however, most values were well below 1, which suggests that the sequences are below saturation even at the deepest level of divergence considered in this study. By contrast, the median of non-synonymous/synonymous (dN/dS) were almost identical for all pairwise comparisons (ranging between 0.037 and 0.058), reflecting a similar level of evolutionary constraint for this set of genes in the different species.
Table 3. Estimated pairwise evolutionary rates for a core set of 253 genes found in the four aphid species and checking all reciprocal best hit criteria. Median values of the non-synonymous (dN) and synonymous (dS) divergence rates, and of the non-synonymous to synonymous (dN/dS) ratio (standard deviation between brackets)
Acyrthosiphon pisum/ Myzus persicae
T. citricida/ A. gossypii
A. pisum/A. gossypii
A. pisum/T. citricida
M. persicae/A. gossypii
M. persicae/T. citricida
We measured nucleotidic compositions at the different codon positions for the 253 putative orthologous sequences present in the four species (Table 4). We found similar compositional patterns for the three other aphid species. Nucleotidic frequencies were almost identical at the first and second positions (which are the most conserved). However, a slight difference appeared between species at the third codon positions (P-value < 10−18, test of Student): T. citricida and A. gossypii were slightly more AT-rich at the third position than their A. pisum tentative orthologues.
Table 4. Means of GC% at the first, second, and third codon positions, and at the third synonymous codon positions (GC3s), computed for a set of 253 putative orthologues shared by the four species
Stars show the significance of the difference with A. pisum (P-value from a test of Student): *P < 0.01; **P < 0.05.
Distances for reciprocal best hits among Acyrthosiphon pisum and another aphid species
We here detail all pairwise comparisons involving the A. pisum official gene set and EST-based partial genomes obtained in another aphid species. For A. pisum/M. persicae, we therefore studied 4334 RBH sequences retained after use of the threshold determined above. The median dS was 0.267 (Table 5), slightly above that of the set of four species-RBH (dS = 0.220, Table 3; P-value = 10−10, Z-test). The difference was more pronounced for the median non-synonymous distance and dN/dS ratios, which were twice as high in the larger set. The smaller distances in the first gene set are explained by the fact that they correspond to a core of genes found in all EST banks from all aphid species. These typically correspond to genes that are expressed at high levels, ubiquitous, and more conserved than the average. Consistent with this explanation, this effect is therefore weak at synonymous positions but stronger with respect to non-synonymous positions.
Table 5. Estimated pairwise evolutionary rates (as in Table 3) for reciprocal best hit genes computed in three inter-specific comparisons
Median values, with standard deviation between brackets. Stars show the significance of the difference with values from the core set of 253 reciprocal best hit (RBH) genes shared by four species (P-value from a Z-test): *P < 0.01; **P < 0.05.
The distributions of dN/dS ratios (Fig. 2) were similar for the three pairwise comparisons, A. pisum/M. persicae, A. pisum/A. gossypii and A. pisum/T. citrida: all distributions were L-shaped, with a low mode and a long right tail corresponding to RBH with an atypically high ratio. We focused on the sequences with the highest ratios in all comparisons, as they might comprise fast-evolving genes of particular interest. We found 248, 60 and 32 genes for which dN/dS > 0.4 for the three pairwise comparisons, respectively, – then discarded, respectively, 10, 2 and 1 pairs of sequences with an alignment length <150 bp. We recorded A. pisum sequences that had a tentative paralogue (see Experimental procedures) and found 10, 1 and 5 in the three pairwise comparisons.
We also recorded all sequences that were RBH in the three pairwise species comparisons which had no Uniprot hit (tentative aphid-specific genes), comprising around 10% of all pairwise RBH. We recorded 445, 159, and 66 such genes for the three comparisons, respectively (Table 6), then discarded, respectively, 2, 1 and 2 pairs of sequences with an alignment length <150 bp. The synonymous distances among the ‘no-hit’ subset were almost identical (P-value < 0.05, Z-test) to the set of all RBH at the pairwise level, but the dN and the dN/dS ratios were significantly higher (P-value < 10−3, Z-test) than the rest of genes. For example, 115, 31 and 15 of the ‘no hit’ genes had dN/dS > 0.40 in the three pairwise comparisons; so about half of the genes such that dN/dS > 0.40 were ‘no hit’ sequences, showing a strong interaction between elevated rates within aphids and lack of detected similarity in Uniprot. Also, tentative paralogues were detected in A. pisum (22, 7 and 4, respectively).
Table 6. Estimated pairwise evolutionary rates (as in Table 3) for genes with no hit in Uniprot, in three inter-specific comparisons
Median of evolution rates in this group
Number of pairs with dN/dS > 0.4
Median values, with standard deviation between brackets. Stars show the significance of the difference with values from all reciprocal best hit genes, presented in Table 5 (P-value from a Z-test): *P < 0.01; **P < 0.05. dN, non-synonymous; dS, synonymous.
We compiled the 5139 A. pisum sequences found in an RBH pair with either M. persicae, T. citricida or A. gossypii. 3141 sequences could be annotated through Blast2GO with 26 138 gene ontology (GO) terms. We performed a separate analysis for A. pisum and each of the three other aphid species. We found an annotation for close to 60% of A. pisum sequences. But when we analysed separately the ‘fast-evolving’ (dN/dS > 0.40) or ‘no-hit’ subsets of genes, only 30% and 7% of A. pisum sequences were annotated, respectively (Table 7). The sets of annotated sequences in the ‘no hit’ group were too small to make statistical comparisons. For ‘fast-evolving’A. pisum sequences in A. pisum/A. gossypii and A. pisum/T. citricida comparisons, we met the same problem (the sample size was too small). But for the A. pisum/M. persicae analysis, we found significant differences among frequencies of GO categories between the ‘fast-evolving’ subset of sequences and the rest of the genes. For example, genes involved in nucleotide/nucleoside binding or in metabolic process (Table 8a) were under-represented (P-value < 0.01, exact Fisher's test) in the ‘fast-evolving’ subset. This result was expected because some genes in that category are typically conserved proteins (i.e. genes coding for translation). More interestingly, 22 GO were over-represented in the ‘fast-evolving’ subset (P-value < 0.01, exact Fisher's test) in 33 genes (Table 8b). We note, however, that the false discovery rate (FDR) was over 0.05, perhaps because of the small number of sequences for GO terms. For example, ACYPI006820 (‘cactus’) and ACYPI001858 (‘spaetzel-1’) are involved in development and innate immunity (anti-microbial and anti-fungal response), in the Toll-signalling pathway (Gerardo et al., 2009). The non-synonymous to synonymous ratios are, respectively, 0.50 and 0.44 for the two genes, far above the average (0.05).
Table 7. Number of annotated genes (identification of a GO term) among reciprocal best hit (RBH) computed in three inter-specific comparisons, for all genes or for two subgroups of the genes (‘fast-evolving’–RBH such that dN/dS > 0.40– and ‘No hit’)
Acyrthosiphon pisum/ Myzus persicae
A. pisum / Aphis gossypii
A. pisum/T. citricida
RBH with no hit in Uniprot
Table 8. Annotation of ‘fast-evolving’ genes (reciprocal best hit) between Acyrthosiphon pisum and Myzus persicae with dN/dS > 0.40)
(a) Gene Ontology term under-represented in the fast-evolving gene set
GO slim term
(b) Gene names (in A. pisum) and associated gene ontology terms that are over-represented in the ‘fast-evolving’ gene set
A detailed study of some genes showing rate acceleration.
We have chosen to exemplify four genes which show some of the highest values of the dN/dS rate in pairwise comparisons and for which we identified putative orthologues in at least three aphid species.
Protein C002. (dN/dS = 0.57 between A. pisum and M. persicae); this gene coined ‘protein C002’ (no hit in Uniprot) has recently been identified as being specific to salivary glands and essential in feeding (Mutti et al., 2008). We obtained predicted full-length CDSs of similar sequences from ESTs in M. persicae and A. gossypii, while a partial CDS was obtained from T. citricida. No paralogue was found in the A. pisum genome. A phylogenetic analysis with different methods (NJ, ML) showed that the gene tree matched the known species tree for the four species, suggesting that the genes are indeed orthologues (result not shown). The global dN/dS ratio (one-ratio model from PAML) was exceptionally high, at 0.73. Alternative models supposing different ratios for the branches or different ratios among sites were not significant.
JHAMT. (dN/dS = 0.60 between A. pisum and M. persicae): this gene is similar to Juvenile Hormone acid methyltransferase (JHAMT) –A. pisum-JHAMT3 in Fig. 3. JHAMT has been shown to play an important role in insect metamorphosis (Shinoda & Itoyama, 2003). This gene has been recently studied by Miura et al. (unpublished) who found that it belongs to a gene family, comprising four copies in the A. pisum genome. The A. pisum gene models were checked in AphidBase (Legeai et al., 2010) (http://www.aphidbase.com) for each of the four copies, and corrected in one case (for A. pisum-JHAMT3). The high level of bootstrap support (85%) at a node comprising A. pisum-JHAMT1, A. pisum-JHAMT2 and A. pisum -JHAMT3 suggest that these three copies arose by two consecutive lineage-specific duplications in an ancestor of A. pisum. They would therefore be co-orthologues of the gene obtained for M. persicae. By contrast, the more distant A. pisum-JHAMT4 was grouped with a gene obtained from T. citricida. These two copies probably correspond to a more ancient duplication of the gene. We evaluated different models for the ratio of non-synonymous to synonymous divergence: the free-ratio model assuming a different ratio on each branch was more significant than the one-ratio model (Table 9). However, site-models and branch-site models (where the group JHAMT1-3 was retained as foreground and compared to the rest) were not significant, i.e. no precise site appeared to be under positive selection.
Table 9. Branch specific tests for variable ratios of non-synonymous to synonymous divergence rates, from Codeml (PAML, Yang, 1997), in four ‘fast-evolving’ genes or gene families identified in at least three aphid species: protein C002, Juvenile Hormone acid methyl transferase-like, Apo (see text), Glycosyl-hydrolase-like
Log-likelihood, number of degrees of freedom and P-value of the LRT between a one-ratio and a free-ratio model: *P < 0.05; **P < 0.01.
No-hit sequence, ‘Apo’. (dN/dS = 0.26 for A. pisum -M. persicae, dN/dS = 0.42 for A. pisum – T. citricida) –A. pisum-Apo2 in Fig. 4. This gene was ‘no hit’ against Uniprot but a research against Interpro resulted in a low-score similarity to ‘apolipoprotein’ (we therefore coined the gene ‘Apo’). We mined all predicted peptides (A. pisum reference gene set) and genomic and transcript sequence data to identify related gene copies in A. pisum. We found five predicted peptides, but only two of them appeared to have the full gene length (ACYPI008569, or A. pisum-Apo2 and ACYPI52923, or A. pisum-Apo3 in Fig. 4) were predicted in the A. pisum gene reference set, comprising four exons (we corrected the initially predicted model for ACYPI52923). A tblastn between the proteic sequence of A. pisum-Apo2 against genomic traces showed many matches (with a coverage of about 30–35 traces all over the sequence), while the sequencing coverage was only 6X. This suggests that there are five to six copies of the gene. Finally, a blast analysis against A. pisum ESTs led us to identify a complete EST-based CDS similar (97% nucleic identity), but not identical, to A. pisum-Apo2.
The EST-based copy (A. pisum-Apo1) has genomic support in two different scaffolds (for the 5’ and 3’ ends, respectively) and in unassembled reads for the rest of the CDS. A fourth copy (ACYPI54280) was predicted as partial (A. pisum-Apo4). Finally two more predicted peptides (ACYPI36683 and ACYPI33174) were partial CDSs containing several stop codons (these are probable pseudogenes, which were not used).
The ML phylogenetic tree (Fig. 4) strongly supported the grouping of the four A. pisum copies (A. pisum-Apo1-4), suggesting they arose as successive lineage-specific duplications where Apo3 and Apo4 appeared as closely related sequences. This was overall a fast-evolving gene, with high absolute numbers of non-synonymous substitutions on branches (Fig. 4). But we found that a free-ratio model was more significant than the one-ratio model (Table 9), due to striking increases of the dN/dS ratio for Apo2 and for the Apo3/Apo4 group, and also on the branch ancestral to A. pisum. Finally, the analysis of a site model and of branch site-models did not identify precise sites under positive selection.
Glycosyl-Hydrolase. (dN/dS = 0.48 for A. pisum-GH1 -M. persicae): This gene matched glycosyl-hydrolases from diverse organisms in Uniprot (we coined it ‘GH’). We found three copies in the A. pisum reference gene set. A single partial CDS was obtained from ESTs for each other aphid species (A. gossypii, M. persicae and T. citricida). A phylogenetic analysis showed that the A. pisum copies are closely related, as proceeding from lineage-specific duplications (Fig. 5). The analysis of different models of dN/dS ratios showed that the free-ratio model was significant (Table 9), the different A. pisum branches having strikingly different ratios. The A. pisum-GH1 branch had a low dN/dS ratio (0.17) while a much higher rate was found for the branch ancestral to A. pisum-GH2 and A. pisum-GH3 (1.45). However, site-models and branch-site models (where the group A. pisum-GH1-3 was retained as foreground and compared to the rest) were again not significant.
Gene reconstruction from expressed sequence tag data and orthology assessment
While EST collections for other aphid species than A. pisum are variable in size, we have been able to reconstruct significant numbers of CDSs in each species (e.g. 6652 CDSs recovered from 27 686 ESTs in M. persicae). Of course, many of these CDSs are partial, or in some cases may correspond to non-overlapping fragments of the same gene. Due to the dissimilar number of sequences in each species, we analysed separately each of the pairwise comparison between an EST-based gene set (for M. persicae, A. gossypii, and T. citricida) and A. pisum. This probably resulted in different numbers (and categories) of genes considered in each comparison. For that reason, we also analysed a core set of genes present in the four species (genes that matched all RBH criteria for all pairwise comparisons, among the four species); given the stringency of this condition, we actually expect that the large majority of these genes are indeed orthologues found in all species. This restricted gene set is biased towards genes with a high level of expression (genes found in all EST banks), but it has the advantage of being comparable among the four species.
Specific differences in nucleotidic composition within aphids
The aphid genome has shown an unusually low GC composition (34.5% in GC3 for A. pisum) (Rispe et al., 2007; International Aphid Genomics Consortium, 2010), compared with that of all other completely sequenced genomes from insect, even the relatively AT-rich honeybee genome. The set of 253 putative orthologues identified in the four aphid species is, by construction, biased towards highly expressed genes and shows a higher %GC3 (37%) than the rest of the genome, in agreement with previous results showing that genes supported by more ESTs had a higher %GC than average in A. pisum (Rispe et al., 2007). Interestingly, we identified a significant difference at the tribe level, since the two species of Aphidini showed a significantly lower GC3 content than the two species of Macrosiphini. This suggests that compositional shifts have occurred along the divergence between the two tribes. More sequence data, in particular from outgroup species (in other aphid sub-families or other families), would be interesting to analyse the dynamics of this compositional shifts.
Variable evolutionary rates among putative orthologues
The evolutionary distances we observed are very close to the mean values found in a recent study from Brisson and Nuzhdin (2007) for A. pisum–M. persicae and A. pisum–A. gossypii comparisons (comprising EST-based gene sets for all species, including A. pisum). The distributions of synonymous distances were Gaussian, with a relatively low range, reflecting that this distance can be a rough estimate of divergence time among two sequences. A few of the four species 259 RBH appeared however, as outliers, showing very high synonymous values (e.g. six genes with dS > 0.60 between A. pisum and M. persicae). We checked by a phylogenetic analysis that three of these genes were non-orthologous (the three pairs with the highest estimated dS value). By contrast, genes characterized by a synonymous distance among RBH within a tribe below a certain threshold (dS < 0.60) were found to be true orthologues (results not shown, on phylogenetic analyses of a sample of genes characterized by a dS close to the threshold value).
Functional characterization of fast-evolving genes
The functional annotation of genes in the ‘fast evolving’ category showed relatively few significant differences with ‘standard’ genes, but it was probably limited by the relatively small numbers of genes found in the different GO categories. This is a limitation of partial gene sets reconstructed from ESTs, and we expect that future collections of ESTs will provide more complete gene sets and therefore more statistical power for these comparisons. One category significantly enriched under Fisher's test – but not so, however, under the FDR test – is of interest: these are genes annotated as ‘defence response to fungus’. Genes involved in defence and immunity are relatively few in A. pisum overall (Gerardo et al., 2009) and this category comprised only three genes A. pisum-M. persicae RBH. But two of those did show very high pairwise dN/dS ratios. It is thus possible that these genes (cactus and spaetzle1-2), involved in the toll signalling pathway are under strong selective pressure in aphids. A phylogenomic analysis available at PhylomeDB (Huerta-Cepas et al., 2008) (http://phylomedb.org/) shows that, while cactus is single-copy in the A. pisum genome, spaetzle-1 has five copies, resulting from serial lineage-specific duplications. These duplications may have enhanced increases of non-synonymous substitution rates. Further data including more sequences from other aphid taxa would be necessary to determine if the gene is actually under positive selection.
We found in each of the EST-based gene sets a relatively large fraction of CDS with no hit in Uniprot (1977 out of 6652 CDS, or 30% in the M. persicae gene set, and similar frequencies in T. citricida and A. gossypii). These CDS may correspond to genes unique to aphids as a group, or to individual aphid species. For example, 445 out of 4334 RBH (10%) among A. pisum and M. persicae were ‘no hit to Uniprot’. These CDSs are tentatively unique to aphids; this also suggests that the majority of ‘no hit’ sequences in M. persicae are unique to that species, and thus that M. persicae possess a significant fraction of its genes not found in other aphids.
We tried to identify the function of the ‘no hit’ genes with Blast2GO but only managed to find a GO term for a small fraction of them. We also found, interestingly, that evolutionary rates are significantly higher in the ‘no hit’ category. The absolute difference is small as for synonymous rates, but it is high for non-synonymous rates and for the non-synonymous to synonymous ratio. This suggests that the ‘no-hit’ category comprises genes that are evolving particularly fast at the proteic level, either because these sequences are under relaxed selection or because they are under positive selection. These high rates of evolution may explain why these genes are only recognized within aphids, as they may have diverged too much from related sequences in other animal groups.
Phylogenetic analyses of some fast-evolving sequences
We have focused on four fast-evolving sequences (high dN/dS rates) present in A. pisum and at least two other aphid species. In two cases (‘C002’ and ‘Apo’), the genes had no hit in Uniprot. C002 has recently been investigated by Mutti et al. (2008) who found that this gene is involved in the process of feeding (the corresponding protein is ingested with aphid saliva in the plant). This gene, for which we found a single copy in each gene set showed a very high dN/dS ratio, close to unity over the whole sequence. While the models of variable ratios among sites were not significant (we could thus not identify specific sites that could be under positive selection, possibly due to a lack of power related with analysing only four sequences), we may thus interpret this very high rate of evolution as the result of an adaptive response, and possibly as the result of strong host-plant interactions. The knock-down in the pea aphid of the C002 gene resulted in defaults in feeding and foraging the phloem sap (Mutti et al., 2008). The fact that C002 has no homologues in other insects but does have orthologues in all four aphid species studied here strongly suggests a specific adaptation. Of course, further functional analyses (and more detailed phylogenetic data, with sequences from more taxa) would be necessary to demonstrate the adaptive aspects of changes of this sequence. In another ‘no hit, fast-evolving’ sequence, ‘Apo’, we also found very high dN/dS ratios, and determined that the ratio increases were related with duplication events. In fact, a relatively low dN/dS ratio characterized most of the gene tree, and one of the A. pisum copies (among four) showed a slow rate. But the three other A. pisum copies displayed much higher dN/dS rates. A similar situation was found when analysing two other sequences, JHAMT and GH: in both cases, the A. pisum genome showed gene amplifications consistent with specific duplications. We found in each case strong increases of the dN/dS ratio in some of the A. pisum copies, sometimes well above unity. This indicates that gene duplication strongly influenced evolutionary rates. While one copy retained a low dN/dS ratio, other copies showed accelerated patterns of evolution. We could not fully determine what governed these ratio increases, as it could be either the result of relaxation or positive selection. However, the fact that we estimated ratios well above unity for JHAMT1, Apo2 to Apo4, and GH2/GH3 suggest that these duplicates could have diverged as the result of an adaptive process.
A high level of gene duplication compared with other insect genomes, spread among many gene families, has been shown to be a pervasive feature of the A. pisum genome (International Aphid Genomics Consortium, 2010). This would therefore provide a very large data set to test theoretical predictions on the relation between duplications and evolutionary rates (Ohno, 1970). It is known that cases of positive selection (Hugues, 1994) often occur among gene families, while strong heterogeneities in rates can also be explained by relaxation of selection (on either one or both copies, Lynch & Conery, 2000). The present study was, however, limited by a relatively small number of aphid taxa and a relatively small number of genes shared by at least three aphid species (the only cases for which a phylogenetic analysis can be conducted). Also, the identification of orthologues with a method of RBHs may have led us to discard some genes found in multiple copies in one or two species (although we argue that the presumably complete A. pisum gene set helps to correctly determine orthologues in most cases). We expect, however, that future sequence data in different aphid species (even with partial genomic sequencing, or larger EST-based gene sets) will be critical to fully evaluate the dynamics of the genome in that group of insects. It will allow a finer-scale study of the unusually high level of duplications (with datations of the different events), and a study of the influence of duplications on evolutionary rates. It is intriguing to find a high level of gene duplication in two Arthropods (the pea aphid and Daphnia pulex) both characterized by a high phenotypic plasticity (Gilbert, 2009); although the mechanisms responsible for a high level of duplication or higher survival rates of gene copies may be different in these groups, it may be advanced that these organisms share a similar selective pressure in favour of keeping a large array of gene copies. Finally, new extensive genomic data for other aphid species will allow the detection of fast-evolving sequences without a priori (whether these genes belong or not to gene families), and determining if these genes show signs of adaptive evolution. This, in addition to functional characterization of the targeted genes, will certainly provide a better understanding of the unusual biological adaptations and life-cycle traits of aphids.
Acythosiphon pisum complete genome, expressed sequence tag collections and assembly
The complete genome sequencing of the pea aphid has been recently achieved, followed by a preliminary annotation: the resulting reference set of protein-coding sequences is constituted of 34 603 genes (Official Gene Set; International Aphid Genomics Consortium, 2010). For other species, different collections of short transcript sequences or ESTs have been obtained from GenBank for M. persicae, A. gossypii and T. citricida and assembled using Tgicl (Pertea et al., 2003).
Reconstruction of protein coding genes from expressed sequence tag sequences
For each species, the unique transcripts were analysed through blastx (Altschul et al., 1990) against Uniprot (cut-off e-value of 10e−10). This information helped to identify potential homology and was used in the CDS detection process. The unique transcripts were also compared (blastx) to the Buchnera aphidicola genome (endosymbiont of aphids) using the A. pisum associated strain (Tamas et al., 2002) and to the mitochondrial genome of the aphid Schizaphis graminum. We then ran FrameD (Schiex et al., 2003), a program which used both blastx results (if any) and a training matrix (based on a large sample of A. pisum genes) to predict CDSs from unique transcripts. Frameshifts, a significant risk with ESTs (single-pass error-prone sequences), were detected and corrected. We retained only CDSs of at least 150 bp. Unique transcripts (contigs) and EST-based CDSs were made available for download at http://www.aphidbase.com/aphidbase/downloads.
Detection of putative orthologous sequences
To identify putative orthologues between A. pisum (complete genome) and M. persicae, A. gossypii and T. citricida, we used the RBH method (Hirsh & Fraser, 2001; Jordan et al., 2002). Although the RBH method does not use a phylogenetic approach, it has been shown that its results are largely equivalent to that of phylogenetic methods (Moreno-Hagelsieb & Latimer, 2007). A strong concern is that the EST-based reconstructed coding genomes are partial sets of coding sequences, with many missing genes. For gene families, this may lead to target genes that are not orthologues, which is a significant issue given that the pea aphid genome has been shown to comprise a high percentage of duplicated genes (International Aphid Genomics Consortium, 2010). However, using a complete genome in pairwise comparisons reduces this risk as shown in Fig. 6. This figure represents a theoretical example of a gene family with lineage-specific duplications in A. pisum and another aphid species (species B), but also with a more ancient duplication. An incomplete data set in species B (e.g. not comprising gene B1) would still allow the correct assessment of the orthology for other copies of that species (either B2 or B3 should be identified as a RBH with A. pisum-3).
Because the annotation of A. pisum may remain incomplete (e.g. if A. pisum-3 was not annotated), RBH may not be orthologues. A threshold synonymous distance was applied to reduce this risk (see Results). In addition to pairwise analyses, we searched potential orthologues present in the four species studied and following all RBH criteria. The multiple constraints needed to follow all RBH relationships among four species increase the chance that these genes represent a core set of orthologues present in all species (Fig. 7).
For all RBH pairs, translated sequences were aligned using T-coffee (Notredame et al., 2000), nucleic sequences being then aligned using the proteic alignment as a guide. Alignments were trimmed using Gblocks (Castresana, 2000) to retain only the parts that aligned well. We then estimated maximum likelihood (ML) pairwise synonymous (dS) and non-synonymous (dN) evolutionary rates, using a codon-based model (Codeml from PAML; (Yang, 1997). The ratio of non-synonymous to synonymous substitutions (dN/dS) is commonly used as an indicator of variable evolutionary pressures among protein-coding genes: low ratios are typical of highly constrained sequences (under purifying selection), while values close to unity would reflect relaxed selection and values above unity would result from positive selection. We aimed to determine the genes with the highest ratios as a potential target of positive or diversifying selection.
Detection of Acyrthosiphon pisum paralogous sequences
Since global analyses of the A. pisum genome have shown a high level of duplication, it is possible that genes identified as RBH may have closely related copies in A. pisum. Since these duplication events may influence evolutionary rates, we detected potential paralogues by performing a blastp (e-value cut-off = 10e-10) of the A. pisum gene reference set against itself. For each A. pisum sequence that was RBH with a CDS from another aphid species, we considered other related copies in A. pisum to be likely paralogues if their distance to the A. pisum target gene was lower than the distance between the two RBH sequences.
Phylogenetic analyses of selected genes
For a sample of genes showing a high dN/dS and available in at least three aphid species, we performed a phylogenetic analysis including all related copies in all species. A ML phylogenetic analysis was performed; parameters of the substitution model were determined by running Modeltest (Posada & Crandall, 1998) through PAUP (Swofford, 2002) to test 56 different models of substitution. Then we used PHYML (Guindon & Gascuel, 2003), choosing the most similar parameters to those found by Modeltest, and ran 100 bootstrap (Felsenstein, 1985) tests of the model.
We then estimated the likelihood of a one-ratio model and of varying ratios along all branches (free-ratio model) or among sites (PAML 3.15, Yang, 1997). The different models were compared using the likelihood ratio test (LRT), which compares twice the differences in log likelihood to a chi-square distribution with the required number of degrees of freedom. We used ‘site’ models M7 vs. M8 (Yang et al., 2005) and the ‘branch-site’ models included in test 2 (Zhang et al., 2005). Test 2 compares a model in which the branch under consideration is evolving without constraint (dN/dS = 1) to a model in which this branch has some proportion of sites evolving under positive selection (dN/dS > 1).
The A. pisum gene reference set and each set of EST-based CDSs in the three other aphid species were blasted against Uniprot (blastX, e-value cut-off = 10e-10). We retained the best hit for each sequence as a tentative annotation for each gene. We then used Blast2GO (Conesa et al., 2005) (http://www.blast2go.org) to identify GO terms associated with sequences. We compiled the GO terms for two groups of genes of particular interest: genes that were characterized by an elevated dN/dS ratio (fast-evolving aphid sequences) and genes that were shared by aphids but had no hit in Uniprot (but for which a GO term was identified). For each pairwise comparison, we compared both groups of genes with the reference group of genes containing all A. pisum RBH except the target set of sequences. Exact Fisher's tests, applying robust FDR correction, were performed through Blast2Go using Gossip (Blüthgen et al., 2005) in order to highlight annotation differences between genes sets (to detect whether a category of genes was enriched in some GO terms).
Analysis of GC rates
The establishment of a core set of potential orthologues shared by each of four different aphid species provided an opportunity to compare interesting genomic traits, such as codon usage and nucleotidic composition (which ideally must be done on orthologue genes). GC percentages at the different codon positions were computed for the set of genes identified in the four species and matching all RBH criteria among any pair of species.
The authors would like to acknowledge Région Bretagne and INRA for funding, Jean-Pierre Gauthier for bioinformatic support and Denis Tagu for comments on this draft.