Life sciences are being transformed by a tremendous growth in the scale and complexity of new data and knowledge, reflecting an era of unprecedented technology development that is enabling increasingly high-throughput and low-cost experimentation. This is all part of a “multiomics” approach to research and the veritable information bonanza it brings, but this, however, is a double-edged sword. Certainly, the resulting genomics, transcriptomics, proteomics, metabolomics, and other large datasets promise greatly improved understanding of biological processes, and translational application thereof. But such progress will depend upon scientists being able to organize, integrate, share, and interpret this wealth of new information in sophisticated and effective ways. Meeting this challenge is far from trivial, and to the extent we fail in this endeavor we risk missing potential discoveries and, even more critically, missing the truth and drawing false conclusions. Obvious examples of such problems from the field of genetic association analysis would include: not being able to account for publication bias when aggregating datasets; employing too little phenotype data to distinguish between similar phenotypes with differing etiologies; and performing meta-analysis without being able to incorporate information about differences in population environments or haplotype structures. New systems biology studies and other projects that consider data across multiple omics disciplines are even more vulnerable to such confounding influences.
The myriad problems involved in properly managing and exploiting today's and tomorrow's life science data relate to things such as the fragmentation of data across hundreds of heterogeneous databases, the lack of standardization, the inconsistent identification of biological objects and concepts [Goble and Stevens, 2008], poor enabling of resource discovery [Cannata et al., 2005], difficulties in facilitating data quality assurance and curation [Howe et al., 2008], and approaches to promoting extensive and yet ethically and culturally acceptable data sharing [Walport and Brest, 2011; Wellcome Trust, 2011]. There is also a need for more effective representation of scientific knowledge distilled from research data, and for linking data and other research objects into future modalities for semantic publishing [Bourne, 2005, 2010; Neylon, 2009; Shotton et al., 2009]. Furthermore, as the real and virtual worlds of science increasingly merge so that research is “done” not just “reported” online, there is a need to come up with completely new paradigms for socioscientific interaction in the digital age [Stafford, 2010], to promote highly collaborative and interactive modes of Internet-based scholarly debate and communication.
In this present communication, to explore and illustrate the challenges and current progress in some of the areas listed above, we will concentrate our focus upon the science of genotype-to-phenotype (G2P) relationships in human and model organisms. Even within this one domain there are many multidimensional challenges to be tackled. On a very basic level, the massive data volumes generated by next-generation sequencing instruments present major informatics challenges for smaller laboratories that utilize these devices (either locally or via external service providers), and this considerably curtails the scientific impact that these new technologies are having [Editors, 2008]. More generally, scientists are facing the herculean task of reporting, cataloging, and managing the seemingly limitless number of G2P interactions being identified by research and diagnostic laboratories on a daily basis. For example, according to the NHGRI GWAS Catalog [Hindorff et al., 2009] (http://www.genome.gov/gwastudies/), genome-wide association studies (GWAS) have been published at a rate of approximately five research articles per week during the past 2 years—but because partial or negative studies are generally not reported, then even this number is a substantial underestimate of the true frequency at which genetic association findings are being produced. Similarly, diagnostic labs routinely perform many DNA mutation scans on patients with traits that have a heritable component, but very little of this information ever gets to be released and utilized by others. Furthermore, there are many other pressing challenges relating to G2P data, not least ethical, legal, and social issues relevant to promoting and achieving the sharing of potentially identifiable data from human subjects [Kaye et al., 2009, 2010; Povey et al., 2010].
Tackling G2P Data Challenges
Traditional approaches to biological databasing have been mostly based on the “centralized” model, characterized by gathering data into a large central hub for storage, integration, and display. Historically, this strategy has proved highly successful. Examples include the global collaboration of nucleotide sequence archives (http://www.insdc.org) established in the 1980s, and sophisticated resources for data analysis and visualization provided by bioinformatics centers such as NCBI (http://www.ncbi.nlm.nih.gov), UCSC (http://genome.ucsc.edu), and EBI/EMBL (http://www.ebi.ac.uk). However, as we argued elsewhere [Thorisson et al., 2008b], centralization alone is insufficient for dealing with the full quantitative breadth and qualitative depth of contemporary G2P data, and so hybrid models combining centralized databases with “federated” networks of distributed data and analytical resources are required to tackle the new challenges facing the G2P data field.
Federation of data storage, provision, and analysis across sites is well established in some other scientific disciplines that have a longer history of dealing with “big data,” such as astronomy and particle physics, and it is also a cornerstone of data-intensive scientific research, or e-Science [Buetow, 2005; Hey and Trefethen, 2005]. Increasingly, such projects employ Web service-based grid computing to enable automated resource discovery and data analysis. A prominent example is the multi-institutional caBIG project (https://cabig.nci.nih.gov), which has constructed a centrally managed and tightly integrated network designed to seamlessly link dozens of cancer research institutions in the United States and internationally [Buetow, 2009; Saltz et al., 2006]. Another example based on many of the same technologies but with a contrasting, decentralized style is the UK-based myGrid family of tools [Bhagat et al., 2010; Goble et al., 2010; Hull et al., 2006; Oinn et al., 2004] (http://www.mygrid.org.uk).
Unfortunately, the majority projects and institutions that produce and analyze G2P data do not participate in these federated and open grid initiatives. Hence, there is a real problem in ensuring that all their valuable data and discoveries become shared and merged into the online universe of G2P information. To help enable this, and to promote and support blended federated-centralized approaches to G2P data exploitation in general, a 5-year Genotype to Phenotype databasing (GEN2PHEN) project was launched at the start of 2008, via a €12 M award under the European Community's Seventh Framework Programme (FP7: http://cordis.europa.eu/fp7/). GEN2PHEN specifically aims to help establish holistic access to G2P information, through modular tool and data standards developments toward a federated network of online G2P resources, and simultaneously to facilitate the bidirectional flow of knowledge between public G2P databases and G2P researchers. Below, we provide a broad overview of the GEN2PHEN project and provide details of one of its main deliverables: the “G2P Knowledge Centre”—an integrated G2P community Website, information resource, tool repository, and comprehensive data access portal.
The GEN2PHEN Project
The GEN2PHEN consortium is made up of representatives from 25 research organizations and companies based in 10 countries in Europe, in Saudi Arabia, and in India (see full list of partners at http://www.gen2phen.org/about/partners), providing exceptional competence and broad expertise in various aspects of G2P data management and exploitation. Their common goal is to improve the effectiveness of G2P databasing, described as “disastrously deficient” in a review written shortly before the project's conception [Patrinos and Brookes, 2005]. In practice, this means enabling heterogeneous and largely unconnected G2P data resources to evolve toward a comprehensive “G2P biomedical knowledge environment.”
Most GEN2PHEN activities are assembled around three core, practical aspects of G2P databasing: (1) devising data standards and pipelines for submitting and collecting data, (2) designing and deploying federated and interoperable modular components for storing and curating diverse datasets, and (3) solutions for exchanging, integrating, and extracting information from the resulting network of federated and centralized databases. Consortium partners all have solid track records in some or all of these areas and are well connected with the broader G2P community. This latter point is essential for aligning and codeveloping GEN2PHEN solutions with those of other allied projects, often via close collaboration. Indeed, consultation, outreach, and dissemination involving the wider G2P community and beyond was prioritized from the very outset of the project.
Details of specific GEN2PHEN objectives, planned and completed deliverables, ongoing activities, and other related information are all published online (http://www.gen2phen.org/about) and so they will not be elaborated here in detail. Instead, this section briefly summarizes the main areas where GEN2PHEN is currently focusing its effort, listing several projects as examples.
To facilitate resource interoperability and enable seamless G2P data exchange and integration, it is vital to increase the overall level of standardization in the field. To this end, GEN2PHEN has worked extensively with others toward developing, refining, and promoting key G2P domain data standards. This includes conceptual models, ontologies and nomenclature conventions, with an overall focus that entails coordinated “bottom-up” standards creation by the community [Brazma et al., 2006; Quackenbush, 2006], rather than “top-down” impositional approaches and formal standardization procedures. Therefore, GEN2PHEN has much in common with, and has connections to, related initiatives such as the Reporting Structure for Biological Investigations Working Groups (RSBI WGs; http://www.mged.org/Workgroups/rsbi/index.html) and Minimum Information for Biological and Biomedical Investigations (MIBBI; http://www.mibbi.org), which promote collaborative development of “omics” reporting standards [Sansone et al., 2008; Taylor et al., 2008].
Examples that embody GEN2PHEN's collaborative approach and success in the area of standards development include close partnering with the groups behind the PaGE-OM for G2P data [Brookes et al., 2009] (http://www.pageom.org), imminent publication of new core data models for phenotype data and locus-specific database (LSDB) content, and a joint effort with the NCBI that has produced the Locus Reference Genomic (LRG: http://www.lrg-sequence.org) framework for standardized reporting of gene variants [Dalgleish et al., 2010] (see Box 1).
Toward a Unified G2P Data Infrastructure
The main thrust of GEN2PHEN's infrastructural work is the creation of a range of reusable databases and software tools, with an emphasis upon federation and Web services. Naturally, these components are all standards compliant, and they provide the G2P community with a suite of technological building blocks for creating new (or augment existing) data systems that can be incorporated into the globally emerging online network of G2P resources. Thereby, in combination with other databases, a fully interconnected, interoperable, and transparently searchable universe of G2P resources can be assembled, for manual and automated data discovery and analysis, as represented in Figure 1.
Similar to the approach taken for standards development, GEN2PHEN favor collaborative, open-source software development and reuse/adaption of existing software where possible. The power of this open, community-oriented approach, is shown by bioinformatics software initiatives such as BioPerl [Stajich et al., 2002] (http://www.bioperl.org) and BioJava [Holland et al., 2008] (http://biojava.org). Examples of GEN2PHEN projects in this arena include software packages for easy creation of LSDBs and close partnership with the team developing Molgenis (http://www.molgenis.org), an open-source platform for rapid prototyping of genomics database software [Swertz et al., 2010] (see also Box 2).
Data Flow, Data Access, and Data Integration
GEN2PHEN is also creating a variety of solutions for search, retrieval, and integration across the G2P information space. This work builds on, and will demonstrate the utility of, databases and software tools created in the project. Initial work has focused on integration and advanced data provision via existing centralized resources, notably the Ensembl genome browser (http://www.ensembl.org). This employs established technologies such as the BioMart data integration system for large-scale data querying [Smedley et al., 2009] (http://www.biomart.org), and the DAS protocol for exchanging record annotations [Dowell et al., 2001; Jenkinson et al., 2008] (http://www.biodas.org), as well as by building new Web services on top of various project databases and by the construction of data discovery platforms.
Highlights of work undertaken to date include data exchange and integration between LSDBs and Ensembl, facilitated by the aforementioned LRG standard. Also, exemplifying the power of the hybrid federation/centralization approach, GEN2PHEN has built HGVbaseG2P [Thorisson et al., 2008a] (http://www.hgvbaseg2p.org)—recently rebadged as “GWAS Central” (http://www.gwascentral.org)—to provide powerful graphical and textual modes for comparing and contrasting multiple datasets from published, unpublished, and private user-uploaded GWAS studies. Finally, as an illustration of how data can be openly exposed yet still shared in a controlled manner, the GEN2PHEN project offers Mendelian gene mutation data via the Café for Routine Genetic data Exchange (“Café RouGE”: http://www.caferouge.org)—an innovative “clearing house” concept that could be easily redeployed to support the safe advertising of many types of data. See Box 2 for a more detailed listing.
Beyond the above practical projects concerned with creating or extending mostly traditional data-centric online resources, GEN2PHEN is working on cultural and policy issues, such as: ethicolegal considerations around G2P data collection and sharing; the idea of providing a BioResource Impact Factor (BRIF) metric for biobanks and databases [Cambon-Thomsen, 2003; Kauffmann and Cambon-Thomsen, 2008]; and designing, creating, and piloting the use of digital IDs for researchers via involvement in the newly formed ORCID initiative (http://www.orcid.org), so that a researcher's online G2P activities and contributions can be discovered, recognized, rewarded, and encouraged. All these different aspects of the GEN2PHEN work program progress in parallel, with links and crossfertilization opportunities being exploited wherever possible. But there is one overriding activity that seeks to bring virtually all the other subprojects together: the G2P Knowledge Centre (KC), a virtual “Center of Excellence” designed to provide a range of new, innovative services to support G2P research.
The G2P Knowledge Centre
The overriding goal of the KC is to provide a central platform amalgamating direct access to distributed G2P data with specialist knowledge, all encompassed within a collaborative scientific online workspace (Fig. 2).
Other scientific disciplines have already embraced this kind of online collaborative data enrichment resource. For example, in the field of nanotechnology, the nanoHUB facility (http://www.nanohub.org) provides direct access to powerful simulation tools coupled with extensive community-driven features such as downloadable lectures and presentations, online seminars, events listings, and mechanisms for rapid publication of data and other results. The nanoHUB project has been hugely successful, with its scope extending to some 1,600 resources involving 600 contributors, and over 100,000 users per year. Such initiatives are far less common in the biomedical sciences, although some notable smaller scale examples do exist, such as Alzforum [Kinoshita and Clark, 2007] (http://www.alzforum.org), which combines access to data from Alzheimer's disease research, discussion forums, event listings, and virtual conferences. In going beyond the remit of simple data portals, such sites can help the scientific endeavor by bringing experts together around common problems and concrete data. So far, however, no such tool has existed for the genotype–phenotype field in general—an oversight the G2P Knowledge Centre seeks to address.
Although not its central mission, one section of the KC serves as the GEN2PHEN project Website. This gives GEN2PHEN a way to leverage the KC environment to disseminate information on the project, to provide full and immediate access to all the project deliverables/outputs and training activities, and to furnish a comprehensive listing of every GEN2PHEN-related tool, Website, and database (see Box 3).
This information thereby contributes to the KC's far broader set of data listings and search capabilities, in turn coupled into an innovative system whereby users may provide per-record annotations for remotely hosted G2P data. Finally, superimposed upon all of this, the KC provides an array of tools for establishing and nurturing active online research communities.
The main features of the KC will now be discussed in more detail below.
A Central Search for G2P Data
Databases holding G2P data are both many and diverse, ranging from per-gene locus-specific databases, to GWAS catalogs and G2P archives. A researcher hoping to track down G2P data for a given locus would need to visit several databases and negotiate differing user interfaces and data formats merely to see what data are available, let alone retrieve, integrate, and analyze those data. The KC seeks to reduce this workload by providing a single access point to the wealth of data stored throughout an extensive federated G2P database network. The principle is simple—a KC search will return a summary of available G2P data organized by source database, with result entries linking back to original records. Relevant records within the aforementioned Café RouGE will also be made available via this central search tool.
This holistic searching is carried out “live”—that is, searches do not serve up old results from an internal database updated periodically by scanning client databases; instead, the system interrogates the many source databases directly, in real time. Therefore, data are always up to date (within a few hours), subject to caching mechanisms put in place both to prevent overloading client databases and to still provide results in the event of a source database being inaccessible.
Crucially, in addition to the broad search capability, the KC aims to go yet further by providing a novel annotation system, whereby users can directly comment on and flag search results from remote databases. The idea here is that these user-supplied annotations will be made available both to database maintainers and to the wider G2P community, both on the KC Website and in machine-readable form via an application programming interface (API), thus providing a community-edited annotation layer for distributed G2P data. This innovative new system, inspired by recent wiki-like community annotation projects like WikiProteins [Mons et al., 2008] (http://www.wikiproteins.org) and the RNA WikiProject [Daub et al., 2008] (http://en.wikipedia.org/wiki/Wikipedia:WikiProject_RNA), allows annotations to be anchored to specific database records or resources, perhaps sparking further community-based debate and discussion. In this way, G2P database content is taken beyond the realm of static database records so that they become “living” entities, enhanced and evolved by user and producer comments.
A Catalog of Locus-Specific Databases
The LSDB catalog provides a perfect example of the KC's philosophy of providing content not only through human-readable Web pages, but also via alternative machine-readable formats, allowing greater integration with other tools and resources. This provision will be expanded upon and enhanced with future incremental KC updates.
News, Blogs, and Event Listings
The KC provides a broad-scope information portal for the G2P community, encompassing not just data and analytical resources, but also useful day-to-day features such as news items, events listings, and blog posts. For the news section, abstracts and short summaries are gathered from relevant journals and other online resources and manually selected for particular relevance. Each abstract or article is linked to the original full article on the source Website. Visitors to the KC can post comments on articles, and read comments posted by others. Additionally, visitors can utilize the site-wide bookmarking system to track updates, alterations, and comments on these items, providing a personalized listing of interesting content via simple Web interfaces and an RSS feed. In particular, the specialist editorial selection of articles is likely to be of high interest to scientists working on genotype-to-phenotype relationships and allied fields. As with most KC content, the regular news digest is available not only through the Web interface, but also RSS feeds. Finally, users are strongly encouraged to contribute news articles that they find interesting, either by submitting full stories or simply by suggesting useful links. These articles or useful links will then be published at an editor's discretion.
Complementing the KC's aggregated news provision, a number of contributor blogs are provided for the dispensation of timely opinion pieces and short rapidly disseminated articles, often of a less formal nature than the aforementioned news articles. These are intended to quickly highlight both scientific and technological developments and provoke healthy debate by the G2P community. Again, blog posts can be obtained via blog-specific and site-wide RSS feeds, and these can be monitored via the site's intuitive bookmarking system. Enquiries from users interested in running their own blog on the KC are welcomed.
The KC also features listings of upcoming G2P-related events that may be of interest to its users, including conferences, symposia, and training events. The community can also use the facility to advertise their own events to others in the G2P field via a simple submission form. Events are displayed in a simple listing, an interactive calendar and on a map.
In general, users can comment upon, and sometimes update, almost any item within the KC. This reflects one of the site's principal goals, that content is not simply posted to be viewed or remain stagnant, but instead it should be allowed to, and encouraged to, evolve so that it drives community debate and hence advances science in the G2P domain.
The “Interest Groups” section of the KC provides self-contained areas of the site (“mini-KCs”) dedicated to particular fields or projects in a manner similar to commonplace Internet forums. Unlike regular forums, however, where users are typically restricted to simple thread-based text messages, users may contribute documents (which will be then be viewable within the Web browser, or downloadable), regular posts, create wiki pages (which may be edited by other group members thus easing the production of collaborative documents), news articles, and events, as part of the content in these groups (Fig. 3). These features provide a flexible and powerful workspace for collaborative groups, and hence are used both within the GEN2PHEN project and by the wider G2P community. Each group is maintained by a dedicated group administrator, whose job it is to manage group posts and memberships. Posts within groups may be restricted to the group, or made visible to all users of the site.
Interest groups further allow collaborative workspaces to be tightly coupled to the resources available and accessible from the KC. Active interest groups as of January 2011 (see http://www.gen2phen.org/community) cover the following topics:
Bio-Resource Impact Factor (BRIF).
Locus Reference Genomic standard (LRG).
Phenotype data modeling.
Utilizing the semantic Web.
Web services and exchange formats.
Most of these groups are available to the general public, whereas a few require users to receive authorization from a group administrator to participate. The KC welcomes and encourages applications from members of the G2P community who wish to utilize the interest group facilities for their own topics and projects. Such proposals can be made via a simple request form available at the site.
System Design and Implementation
The KC has been constructed using the Drupal content management system (CMS) (http://www.drupal.org). Despite a steeper development learning curve compared to other popular CMSs, Drupal was selected for its robust and extremely flexible code base with which to build sophisticated Web applications. The standard platform has since been considerably extended using a combination of both public domain contributed modules and several custom coded modules specific to the KC implementation. As with other GEN2PHEN software packages, these extensions will be made available for download as open-source software.
As summarized in this article, even though only two-thirds of the way through its funding period, the GEN2PHEN project has already generated many key resources and paved the way to a fully integrated, community-enhanced G2P network. In its current form, the KC provides the G2P community with a central hub for data, other useful information, and community interaction. To further leverage these powerful tools and resources, the project generally (and the KC in particular) will continue to explore new methods to distribute its contents besides the human-readable HTML-based Website. Besides commonly used formats for Web content syndication such as RSS and Atom, technologies and formats such as the resource description framework (RDF) may be employed to provide data and leverage the potentially immense power of the Semantic Web [Berners-Lee and Hendler, 2001; Berners-Lee et al., 2001]—a self-describing Web-based global network of linked data (http://linkeddata.org). An associated expansion of Web services both on the KC and as part of other project resources will simultaneously help to facilitate Web-based data integration or “mash-ups” [Cheung et al., 2008], and machine-oriented knowledge generation.
Most ambitiously of all, the GEN2PHEN project has recently begun exploring how G2P data and related tools might be adapted or newly created to move G2P knowledge beyond the research domain and into the healthcare environment. Clearly, this raises many challenges that are far too large and way beyond the scope of GEN2PHEN itself. But the very successful philosophy the project has followed, with its emphasis upon integrated community development work toward fully interoperable information networks and the bidirectional flow of information to/from the user community, probably represents a good template for future projects seeking to integrate G2P and other bioscience realms, especially if aiming at delivering improved healthcare.
In summary, even though we argue that GEN2PHEN has made a good start in a range of directions, we are fully aware that a massive amount of further work needs to be done if scientists are to fully meet the challenge set out at the start of this article: that is, to effectively organize, integrate, share, and interpret the wealth of new life science information in sophisticated and effective ways. We believe this challenge can and will be met, and foresee many exciting and revolutionary years ahead as this is achieved.