AphidBase: a centralized bioinformatic resource for annotation of the pea aphid genome

Authors


Fabrice Legeai, INRIA –équipe Symbiose, Campus de Beaulieu, 35042 Rennes, cedex, France. Tel.: +33 (0) 299 8471 18; fax: +33 (0) 299 8471 71; e-mail: fabrice.legeai@rennes.inra.fr

Abstract

AphidBase is a centralized bioinformatic resource that was developed to facilitate community annotation of the pea aphid genome by the International Aphid Genomics Consortium (IAGC). The AphidBase Information System designed to organize and distribute genomic data and annotations for a large international community was constructed using open source software tools from the Generic Model Organism Database (GMOD). The system includes Apollo and GBrowse utilities as well as a wiki, blast search capabilities and a full text search engine. AphidBase strongly supported community cooperation and coordination in the curation of gene models during community annotation of the pea aphid genome. AphidBase can be accessed at http://www.aphidbase.com.

Introduction

A high quality genome sequence is a prerequisite for whole genome analyses but further, robust and complete annotations are essential for a genome to be fully utilized by the scientific community. Genome annotation involves mapping features such as protein coding genes and their multiple mRNAs, pseudogenes, transposons, repeats, non-coding RNAs, SNPs as well as regions of similarity to other genomes onto the genomic scaffolds. Many of these features can be automatically predicted by sophisticated software packages based on sequence or structure comparisons.

The identification of protein-coding genes is widely considered to be critical to understanding the biology of an organism (Stein, 2001). Gene prediction programs identify protein-coding genes using either ab initio prediction, evidence-based prediction, or a combination of the two methods. Evidence-based prediction programs such as Augustus (Stanke et al., 2006), Fgenesh++ (unpubl., http://www.softberry.com) or the NCBI RefSeq pipeline are generally considered most reliable because they are better able to characterize untranslated regions (UTRs) and alternative splicing especially when cDNA sequences representing full-length mRNAs or large numbers of ESTs are available (Brent, 2008). However, in many cases gene predictions require evaluation by specialist biocurators of a gene family or a pathway.

One of the most challenging aspects of dispersed community annotation is the need to maintain consistent data formats, and to minimize the potential for duplicated annotation made simultaneously by two different annotators (Elsik et al., 2006). This requires annotation tools, standardized methods, oversight by expert curators and a combination of social infrastructure, tool development, training and feedback (Howe et al., 2008). The main mistake during manual annotation from previous genome projects was allowing the submission of incomplete data and data inconsistent with itself. This resulted in annotated genes with missing co-ordinates, protein and mRNA sequences that did not match, and a number of other issues that pollute the databases with incorrect information. To remedy this problem, VectorBase asked submitters to supply data in a spreadsheet format including gene prediction and gene symbol descriptions (Lawson et al., 2009), while BeeBase developed procedures for handling community-annotated gene models, that included mapping, checking for errors and redundancy, assigning identifiers, and incorporating them into the database (Elsik et al., 2006). Here we adopted Apollo (Lewis et al., 2002), a software specialized in the editing of annotation. Apollo provides a graphical, straightforward and controlled approach for manual curation.

The role of the biocurators is not only to inspect and correct automatically predicted gene structures and proteins, but also to add value by connecting information from different sources in a coherent and accessible way (Elsik et al., 2006; Howe et al., 2008). Assembling and curating the datasets generated during annotation of a genome is a labour-intensive and relatively slow process (Wilming et al., 2008) but annotation can be spread over a large number of people to accelerate the process. The efforts of a strong, organized, motivated and voluntary community with many researchers specializing in a variety of gene families of interest, can greatly improve the annotation of a genome sequence; these criteria were met by the International Aphid Genomics Consortium (IAGC), whose goal is to develop genomic resources for aphids. The IAGC recently supervised the sequencing, assembly and analysis of the first aphid genome, that of the pea aphid Acyrthosiphon pisum (International Aphid Genomics Consortium, 2010). Members of the IAGC represent a large community of aphid specialists all over the world collaborating on the analysis of the pea aphid genome. The annotation datasets generated by the IAGC needed an official, centralized repository providing worldwide access, that is now provided by AphidBase.

AphidBase, formerly a web application for the analysis of aphids' expressed sequence tags [ESTs (Gauthier et al., 2007)], has been upgraded to a comprehensive genome information resource dedicated to aphids. Incorporating the best features of other eukaryotic model organism databases – such as WormBase (Rogers et al., 2008), Flybase (Wilson et al., 2008) and VectorBase (Lawson et al., 2009), AphidBase provides detailed information about the aphid and its scientific community, includes a genome browser for visualizing genome annotation, and robust search capabilities. AphidBase is also a central node for communication between the aphid community with links to a collaborative wiki and a specialized database on the metabolic networks of aphids and their symbionts (AcypiCyc, http://pbil.univ-lyon1.fr:2555/ACYPI/) and to a comprehensive phylogenomic database for the pea aphid (PhylomeDB, Huerta-Cepas et al., 2008).

Results

Manual curation

A subset of 10 248 genes predicted by Gnomon were strongly supported by biological evidence and have been inserted in RefSeq, the NCBI database Reference Sequences database (Pruitt et al., 2009). The high quality of these RefSeq predictions allowed their inclusion in the first A. pisum reference set (Acyr 1.0). When no RefSeq gene was available, Glean (Elsik et al., 2007), a tool that integrates gene predictions from distinct softwares (listed in Table 1) was used to create consensus gene models. The first official reference gene set of A. pisum, is composed of 34 603 automatically predicted genes, corresponding to 34 821 transcripts and proteins (International Aphid Genomics Consortium, 2010). Following assembly of this official gene set, the IAGC commenced manual curation for the appraisal of this set.

Table 1.  The content of AphidBase
CategorySoftwareResultsComments
Protein coding gene predictionsAcypi 1.0 (reference annotation)34 603 genesThis first reference incorporated set is a subset of the Gnomon genes predictions strongly supported by biological evidence and inserted in RefSeq (Pruitt et al., 2009). This subset has been enriched with Glean predictions that do not overlap with RefSeq genes
Gnomon37 994 genesGnomon (http://www.ncbi.nlm.nih.gov/projects/genome/guide/gnomon.shtml), is the NCBI gene prediction program.
Glean36 606 genesGlean (Elsik et al., 2007) is a software that computes consensus gene predictions. We input Augustus, Fgenesh++, Gnomon, GeneID Genscan SNAP and Maker (Cantarel et al., 2008) gene models.
Augustus34 677 genesAugustus (Stanke et al., 2006) predictions incorporate extrinsic evidence with sequence intrinsic evidence. Extrinsic evidence was taken from GMAP (Wu & Watanabe, 2005) alignments of Acyrthosiphon pisum ESTs and alignments of proteins from 3 other insect species (Nasonia vitripennis, Tribolium castaneum and Daphnia).
GeneID55 644 genesGeneID (Parra et al., 2000) was applied to the A. pisum genomic sequences masked against Repbase Repeat database invertebrate division (Jurka et al., 2005).
Fgenesh++26 773 genesThe fgenesh++ or fgenesh pipeline (http://www.softberry.com) is a combination of two rounds of the ab initio algorithm fgenesh and two rounds of fgenesh+ which takes into account homologous protein alignments.
Maker22 738 genesMAKER aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations (Cantarel et al., 2008).
Genscan32 322 genesGenscan is an ab initio gene prediction software (Burge & Karlin, 1997).
Non coding RNA predictionsmiRNA finder189 miRNAsmiRNA were identified by coupling a computational approach using sequence similarity with known miRNA genes and training and structure recognition algorithms to biological validation by high-throughput sequencing (Legeai F, Rizk G, Walsh T, Edwards O, Gordon K, Lavenier D, Leterme N, Méreau A, Nicolas J, Tagu D, Jaubert-Possamai S. microRNAs of the insect crop pest Acyrthosiphon pisum, in preparation).
tRNAscan-SE348 tRNAstRNAscan-SE (Lowe & Eddy, 1997) is a widely used software for tRNA identification.
Similarities to A. pisum transcriptsEST mapping235 621 alignmentsThe ESTs were extracted from the NCBI nucleotide database using Entrez and aligned using sim4 (Florea et al., 1998).
Unigenes mapping27 427 alignmentsThe unigenes were assembled using the tgicl software (http://compbio.dfci.harvard.edu/tgi/software/) and the resulting unigenes were mapped on the genome using sim4 (Florea et al., 1998).
Similarities to transcripts from other aphid speciestblastn or tblastx30 840 alignmentsPutative coding sequences from EST of other aphid species (Toxoptera citricida, Myzus persicae, Aphis gossypii, Rhopalosipum padi) were predicted by Frame D (Schiex et al., 2003) and used with tblastx to directly compare to the genome sequence
Similarities to protein databanksBlastx vs. Flybase27 900 alignmentsBlastx against Flybase, Drosophila melanogaster release 5.6
Blastx vs. Beebase14 480 alignmentsBlastx against Beebase Apis mellifera protein database release 1
Blastx vs. Uniprot194 290 alignmentsBlastx against Uniprot trembl and swissprot release 14.2

Manual annotation of the pea aphid genome was completed by a group of 96 people from 10 countries self-organized into 27 annotation groups of 1–30 individuals. Members of this group of expert biologists appraised the automatic annotations and, in doing so, corrected genes boundaries, found new genes and increased the information content of gene models by specifying functional characteristics, or simply by delivering comments or evaluations. To facilitate the process of manual curation, Apollo was set up in AphidBase. The Apollo genome editor is a Java application for browsing and annotating genomic sequences. It offers many functionalities facilitating the correction of gene structures and allowing users to probe, manipulate and alter the interpretation of gene models. Within Apollo, annotations can be created, deleted, merged, split, classified and commented on. For example, one can easily locate and correct incorrect splice sites or start/stop codons, classify a gene as a pseudogene, and even create a new alternatively spliced RNA. Using AphidBase's Apollo configuration, a curator validates or modifies a reference annotation or creates new annotations by a simple drag and drop of any form of gene evidence (predictions) from one panel to the other. Apollo then automatically generates a unique identifier. As a result, the curators are not directly modifying the reference set but rather append a new annotation layer. Finally, at each release, curated genes automatically replace their previous referenced versions. This process does not require reviewing or double-checking; however, author names are attached to each annotated gene in order to facilitate collaborative work.

Despite development of the Apollo bioinformatic environment, annotating genes remains a laborious process that requires rigor. To this end the IAGC developed recommended practices and standardized procedures (Figs S1–S4) that were published by IAGC on the Aphid Genomics Collaboration Wiki.

Because, curators are naming genes manually, and because names are most useful when they are descriptive, particular attention was made to clarify aspects of nomenclature. Two types of nomenclature are associated with a given A. pisum gene, the gene symbol (a symbol or abbreviation) and the full gene name (gene description). When possible, and if the orthology is clear, the drosophila or human gene names or descriptions have been used by the curator because they are controlled by the Flybase and HUGO consortiums, respectively.

Nine months after the beginning of the manual curation process, 2010 genes had been manually annotated. Among these manually annotated genes, 1536 genes were tagged as ‘finished’, i.e. their current structures were considered correct according to the available biological data and current knowledge. While most of these genes correspond to a RefSeq prediction, 50 (3.3%) genes were not present in the first reference set (Table 2). Within annotation groups, curators predominantly investigated genes with at least some biological evidence and similarity with known proteins. The fact that generation of the RefSeq set requires biological evidence explains the over-representation of genes having a RefSeq source in the curated set. Only 19% of RefSeq genes, compared with 80% of the Glean predicted genes, were hand-corrected by annotators. This difference in the frequency of hand-correction reflects a lower confidence level on predictions when no biological evidence was available. In summary, about 28% of the predictions needed correction, a rate similar to that associated with the best methods used on the human genome (Guigóet al., 2006).

Table 2.  Curated gene statistics according to their origin in the reference prediction set. Appraised genes are those examined by a biocurators. Corrected genes are those for which biocurators made changes to the automated gene model. Ratio of corrected genes to curated genes is shown in parentheses. Merged genes includes annotated genes that join 2 or 3 Glean or RefSeq predictions. Resulting number of annotated genes is the final sets of improved genes
 RefSeq ModelsGlean ModelsTotal
  • *

    1 RefSeq predictions was merged with another RefSeq prediction.

  • 9 Glean predictions was merged with another RefSeq prediction.

  • 8 Glean predictions were merged with RefSeq predictions.

Number of appraised genes12862181504
Number of corrected genes250 (19.4%)175 (80.3%)425 (28.2%)
Number of merged genes10*2636
Resulting number of annotated genes12812051486

AphidBase

AphidBase (http://www.aphidbase.com) is an information system set up to safely centralize, manage, mine, disseminate and promulgate data generated by the IAGC. This Information System is based on GMOD (http://www.gmod.org), the Generic Model Organism Database Project, a largely open source project aimed at developing a complete set of software packages for creating and administering the genome database of a model organism. Among others, components of the GMOD project include a genome browser and editor (GBrowse (Stein et al., 2002) and Apollo (Lewis et al., 2002)), a robust database (Mungall and Emmert, 2007), as well as biological ontology tools, and a set of standard operating procedures. Implementation of GMOD, a system that is widely used in the bioinformatics community and thus, well supported and documented, gave us the opportunity to simply set up integrated but flexible solutions to meet the majority of our needs for data storage, controlled vocabulary, visualization and exploration.

A Gbrowse genome browser directly connected to the Chado database offers a large number of configurable tracks that are listed in Table 1. Each Gbrowse detailed feature contains links to other sources of information. For example, each gene is directly connected to the following: (1) its NCBI Entrez page allowing the gathering of functional information and a link to BLink, the NCBI Blast results visualizer tool; (2) its phylogenetic tree established by PhylomeDB (Huerta-Cepas et al., 2008, 2009) and (3) AcypiCyc, a metabolism BioCyc database for A. pisum (Vellozo et al., manuscript in preparation).

AphidBase also provides a configurable Blast search page permitting comparison to A. pisum sequence databanks (reads, scaffolds, official gene and protein sets, predictions and cDNAs). When possible, in order to facilitate web navigation, a reported hit is linked to its genome location in Gbrowse or to its detailed page resources (e.g. NCBI Entrez page or FlyBase gene report). In parallel, a full text search engine monitored by Lucene (http://lucene.apache.org/) allows a rapid keyword search among the gene annotations or the description of their homologous proteins.

In order to facilitate manual annotation and communication among annotators from multiple labs, organizations and even countries, we made use of a range of collaboration tools including an email listserve, weekly teleconferencing, interactive webforms, an annotation workshop whose goal was equipping the novice annotator with the skills and tools they would need to annotate their genes of interest, and a collaboration wiki.

The IAGC Collaboration Wiki (https://dgc.cgb.indiana.edu/display/aphid/Introduction) served three purposes. First, it provided an information centre where all sorts of information to assist annotators, including annotation guidelines, nomenclature instructions, training resources and important announcements from the IAGC Steering Committee were disseminated to the community. Second, the wiki played a role complementary to the electronic mailing list in that a discussion that would normally run over many back and forth email exchanges could be summarized on a single wiki page. Third, the IAGC Collaboration Wiki served an on-line workspace, where any member could contribute to the community in a decentralized way. Within the wiki, each annotation group has its own collaboration site, allowing multiple members of the group to edit the page simultaneously ensuring that the information is current and accurate. Finally, the IAGC Collaboration Wiki is equipped with access control, allowing for restricted access to specific parts of the wiki, thereby facilitating the sharing of prepublication data and free discussions amongst collaborators.

Future evolution of AphidBase

The annotation of a complete genome provides an opportunity to unite the strengths of a diverse community and yet the success of such a project depends critically on a genome information system, such as AphidBase. The current challenge for AphidBase is to implement and/or develop tools to remain functional and accessible as new aphid data accumulates so as to enable the IAGC to make rapid pure and applied scientific advances with these data.

Although the gene curation process is ongoing, we already noticed that only just less than 20% of the inspected genes with cDNA coverage or protein similarities have been manually refined, while about 80% of the genes from the Glean reference set (i.e. the ab initio gene models), required manual curation to improve their automatically predicted structures. In conclusion, a single and automated annotation of genome is not acceptable when only a small amount of transcription evidence is available and when the well annotated genome of a closely related species is lacking. Hopefully, new and cheaper technologies are producing more and more sequence reads of either cDNAs (RNA-Seq) or whole genomes. Taking into account new future complete genomes and libraries of millions of cDNA sequences will improve the annotation quality, but demand computer and informatic platforms able to deal with such large amounts of data. In this context, new automatic procedures are now able to incorporate the product of massive scale cDNA sequencing projects to correct gene models or to predict more genes or splice variants (Denoeud et al., 2008; Wang et al., 2008). The AphidBase strength will lie in its ability to frequently upgrade gene models by using these strategies, combined with the effort made by its scientific community for appraising gene models with regard to new evidences. The reannotation process will impact on manually curated genes, in that it will be supervised by experts.

Thereafter, the future of AphidBase will be strongly affected by the quantity of its inherent biological data. Pursuing this ambition, AphidBase is working to improve and automatically update gene annotations by adding functional tips such as a gene belonging to a protein family, its known domains, its classification under a Gene Ontology term or even, when possible, inference of its protein structure. Moreover, AphidBase is expanding annotated features and will soon integrate, for example, transposable elements predicted by the Repet pipeline (Quesneville et al., 2005), putative SNPs derived from the comparisons of ESTs, microsatellites, or new non-coding RNAs.

The large acceptance of AphidBase will also depend on its panel of given functionalities and tools. For example, one of the outstanding features of the pea aphid genome discovered during the community annotation process is a very high level of gene duplications (International Aphid Genomics Consortium, 2010). So, easy navigation between paralogous genes and tools for graphically comparing their surroundings such as Synview (Wang et al., 2006) or Gbrowse_Syn (http://gmod.org/wiki/GBrowse_syn) appears to be the key means for increasing the knowledge of the evolution of the aphid. In addition, Biomart (Smedley et al., 2009) would be a convincing and efficient solution to help AphidBase users to perform advanced and complex queries on biological data sources, regardless of their geographical location through a single web interface. Finally, we are now implementing functional web pages about gene, transcript, peptide or ontology terms, which will summarize available information and expertise at a glance.

To conclude, AphidBase will also be strongly affected by the quality of its data; in other words in the level of human curation involved in the procedure including expertise or literature references. Consequently, implementing a wiki for gathering functional annotation appears to be a good solution because of its easiness and availability, wide scope and flexibility (Salzberg, 2007; Mons et al., 2008). However, wikis still lack integration with databases such as Chado, or other data warehouse systems.

Materials and methods

AphidBase

AphidBase is a Chado database v.0.5. Various software was used and several bioinformatics groups were engaged in annotation of the pea aphid genome sequence (International Aphid Genomics Consortium, 2010). BioPerl (http://www.bioperl.org) was used to parse and transform all data files into the standard GFF3 format (http://www.sequenceontology.org/gff3.shtml) required by the Chado database loader. As a result Gbrowse, directly connected to the database, offers a large number of configurable tracks (Table 1).

Apollo

Apollo is connected to a duplicate of the public AphidBase Chado database (Fig. 1), enabling users to directly load and save their modifications and editions to this database. Both databases are fed synchronously, in such a manner that experts or users get the same information either while browsing through Gbrowse or while editing through Apollo. The single difference between the duplicates is that the Apollo dedicated AphidBase copy contains current manual annotation data. All curated genes marked as «finished» in the Apollo ‘Annotation Editor’ dialog box are routinely released into the public GBrowse AphidBase database.

Figure 1.

Data flow of AphidBase. Two databases are fed in parallel with data computed by the administrator or submitted by providers. Regular users are accessing data stored in the databases using different front-ends. Authorized curators are inserting and modifying their annotations through Apollo, saving changes to their gene models directly in the specialized database. ‘Finished’ annotations are frequently exported to the public AphidBase database.

For reasons of safety and traceability, the AphidBase administrators assigned usernames and passwords to authorized curators. Thus, only authorized curators can modify or comment on genome annotations in the Apollo copy of AphidBase.

AphidBase's Apollo database can be started with Java WebStart, allowing the application to be started directly from the AphidBase web site. Furthermore, before launching the application, Java WebStart automatically looks for an update via the internet, and downloads it if necessary.

Blast

AphidBase offers a web blast search (version 2.2.15) that allows the parameterized comparison of nucleic and protein sequences against various databanks (transcripts and protein predictions, reads and scaffolds and ESTs).

Lucene full text search

The AphidBase full text search is based on the Apache Lucene indexation of flat files of the description of RefSeq predictions and the Uniprot proteins aligned on the genome, extracted from the Chado database. It has been generated with the help of the Lucene Java API encapsulated into an Apache Tomcat server.

Community organization

The aphidgenomics electronic mailing list is managed using GNU Mailman, an open source mailing list management software written Python (http://www.list.org/). The aphidgenomics server is hosted in the Department of Ecology and Evolution at Princeton University (http://www.eco.princeton.edu/mailman/listinfo/aphidgenomics). The IAGC Wiki is run on Confluence, an enterprise wiki engine (Atlassian, Sydney, Australia, http://www.atlassian.com/software/confluence/) and hosted at Indiana University.

Acknowledgements

The authors would like to acknowledge the International Aphid Genomics Consortium and in particular all the curators involved in gene model improvement. We thank Ed Lee, Olivier Arnaiz, Scott Cain, Dave Clements and the rest of the GMOD team, Stéphanie Sibide-Bocs, Joëlle Amselem and Michaël Alaux from ANR GnpAnnot. The assistance of IT staff of AphidGenomics at Princeton University, and of Phillip Steinbachs for the establishment of the IAGC Wiki hosted by The Center for Genomics and Bioinformatics, Indiana University were fully appreciated. AphidBase is partly funded by ANR Génoplante ‘GnpAnnot’. The Pea Aphid Genome Annotation Workshop I was supported by an American Genetic Association Special Event Award and an NRI, CSREES, USDA award 2007-04628 to ACCW.

Work at NCBI was supported by the Intramural Research Program of the NIH, National Library of Medicine.

Ancillary