Next generation DNA sequencing instruments have made large scale sequencing widely available to users. While sequencing was dominated by large scale sequencing centers as recently as a few years ago, the advent of novel sequencing machines [1, 2] have begun a process referred to as the democratization of sequencing. As this phenomenon continues, the costs and availability of sequencing machines will improve broadly. Meanwhile, the data production rates of these sequencing instruments are growing rapidly, leading to a crisis in data analysis.
Next-generation sequencing instruments are used in a variety of areas; however, the area that has seen the most dramatic change in computational requirements for the analysis of shotgun metagenome data or metagenomics [1-3]. Metagenomics, also known as community genomics, is the study of genomic data from environments of unknown biological composition. Metagenomics analysis techniques are routinely used to study microbial communities in a variety of environments. For example, the TARA expeditions (http://oceans.taraexpeditions.org), the Terragenome Project, the Global Ocean Sampling , and the Human Microbiome Project , all make heavy use of metagenomic analysis techniques.
The amount of public DNA sequence data, generally from conventional genomic projects, over the past two decades has grown dramatically, doubling about every 14 months (Figure 1). Early experience with metagenomics projects suggests that metagenomics sequence data will have a substantially shorter doubling time . While traditional genome sequencing requires establishing an in vitro culture of the organism to produce sufficient DNA material for sequencing, metagenomics relies on direct sequencing of environmental samples without the intermediate in vitro step. Removing the culturing step has led to a flurry of metagenome projects and has led sequence based biology into a new phase as the major challenge is shifting from generating to analyzing sequences.
In this paper, we will present the metagenomics-rapid annotation using subsystem technology (MG-RAST) service (Subsection 'Metagenomics-rapid annotation using subsystem technology'), including its architecture and computational bottlenecks. We will also discuss the process through which we adapted MG-RAST to make use of cloud computational, including the technical challenges we faced as well as unexpected issues.
1.1 Metagenomics-rapid annotation using subsystem technology
The MG-RAST server  that allows public upload and analysis of data via its web portal has over 2000 users, from 30 countries, and has been operating since 2007. As the sizes of metagenomic data sets have grown, so as the computational costs of the analysis. Our goal over the last several years has been to keep pace with the increases in sequencing data set sizes, so that users can continue to analyze similar numbers of data sets, albeit with higher fidelity provided by newer, larger data sets. In order to do this, we have had to scale up the computational capacity of MG-RAST.
In order to provide a frame of reference for both the absolute demand for computational capacity, as well as some relative growth rates, we have performed a substantial amount of benchmarking for MG-RAST . During the course of this work, we determined that the analysis of 1 gigabase of data using the MG-RAST v2 protocols consumed approximately 2000 core hours on Intel Nehalem machines. A majority of this time is spent performing sequence similarity analysis. The data growth rates are far more troubling; we have observed a growth from 17 gigabase data sets to well over 300 gigabases in the last 18 months. This growth rate is unlikely to slow in the next few years. Taken together, these factors require both a quickly growing computational infrastructure as well as approaches to effectively harness it.
In mid-2009, a moderately sized cluster of 32 commodity machines supported MG-RAST analysis. With this scale of resources, our pending workload far outstripped our capability to analyze the data. Our analysis techniques had already been optimized; the majority of the cost came from the similarity computation (Figure 2), performed by National Center for Biotechnology Information (NCBI's) Basic Local Alignment and Search Tool (BLASTX)  code. Unlike large-scale comparison to DNA databases where a variety of alternative codes exist (e.g., [4-6]), there were no algorithmic alternatives to the computationally expensive BLAST code. Furthermore, even with significant improvements in algorithms, data set size and user base growth both require considerable expansion of the underlying computational platform.
Although expanding the local computational cluster seemed like a promising idea, the emerging cloud computing paradigm caused us to investigate the utility of cloud computing for our application. Our assumption for the computational profile of cloud resources was the traditional view of Infrastructure as a Service cloud offerings: nodes with a variety of CPU and memory configurations with commodity networking and the associated poor inter-node connectivity.
Here, we discuss the process of determining the suitability of computations in the MG-RAST computational pipeline to the cloud computational paradigm. We will begin by describing the design of the system, the MG-RAST computational pipeline, workload, and initial computational resources.
1.2 The metagenomics-rapid annotations based on subsystem technology system design
The system provides a web portal and a computational pipeline for the analysis of metagenomics data sets. Users submit sequence data sets via the MG-RAST portal, a web interface implemented as a series of Perl cgi scripts. This data is loaded into a Postgres database, along with information about which analytics should be performed and metadata about the data set. This data is used, in turn, to submit jobs into the computational pipeline; in v2, the work was submitted through a traditional resource manager (Sun/Oracle Grid Engine), whereas in v3, a locally developed workflow engine (Argonne Workflow Engine; AWE) is used to manage work execution. AWE is described below.
1.3 The metagenomics-rapid annotations based on subsystem technology computational pipeline (v2)
Once uploaded, the data exist in form of standard sequence file formats accompanied by metadata (see the work of the GSC ). The pipeline consists of a series of perl scripts that execute various data transformations; each state is run as a job in the resource manager, as seen in Figure 3.
The initial stage of the pipeline is basic quality control, detection of duplicate sequences, and calculates some statistics. The sequence data is subsequently split into smaller files and distributed to the computational cluster. Each sequent subset is run through BLASTX similarity searches, identifying matches of the nucleic acid sequence data to amino acid sequences in a large non-redundant sequence database. Although each sequence data subset contains 100,000 characters of sequence data (or 100 kilobases), the non-redundant protein sequence library contains 1.7 gigabytes of amino acid data. This is a so-called “pleasantly parallel” problem, as the data can be subdivided into independent pieces with no interdependence.
Upon completion of the similarity searches, several derived data products are computed (e.g., reconstruction of microbial metabolism and a listing of species present in the data set) and subsequently loaded in the portal's database.
1.4 Local computational resources
The computational system that powers MG-RAST is a fairly standard Linux cluster managed by Sun Grid Engine (SGE; ). It was comprised of 32 compute nodes in a variety of CPU and RAM configurations. Data are mounted on every node via network file system, and there is a standard user environment. An in-house software package was written to automate the running of the pipeline. It handles management of tasks in SGE, book keeping, and interactions with the MG-RAST web interface.
Metagenomics-RAST's original architecture depended on a tightly coupled system environment. Although this approach simplified development processes, it limited our computational operation to local resources that we controlled. This limitation increased the analysis backlog, as MG-RAST grew in popularity. In particular, we were unable to take advantage of other computational resources provided to the scientific community at large.
Procurement of dedicated hardware also poses difficulties. The lead times involved in procurement and deployment of hardware limit our ability to react to peaks in demand in a timely fashion.
2 MAKING USE OF DISTRIBUTED HETEROGENEOUS RESOURCES
Because of a combination of the increasing demand for analysis via MG-RAST, as well as the quickly increasing data set sizes, it has become evident that dedicated computing resources will not provide sufficient capacity for even the short term, with considerably larger shortfalls in the medium and long terms. When tackling this problem, our first approach was to scale out the MG-RAST backend to existing shared resources at Argonne, following closely with the use of cloud resources. This change posed some technical challenges, due to differences in the general infrastructure provided on shared systems, as well as the switch to using distributed resources.
2.1 Technical challenges
Several high level issues quickly presented themselves, as we adapted the MG-RAST pipeline to the use of distributed resources. First and foremost, we realized that we had an orchestration problem; MG-RAST had previously depended heavily on the use of SGE for job execution reliability and prioritization, which was not in use on these shared resources. Analysis operations had a variety of system infrastructure requirements, some requiring large amounts of memory, local disk space, or producing large quantities of output.
2.2 Choosing which analysis operations to distribute first
An examination of the computational needs of analysis stages in the MG-RAST pipeline showed that they fell into a three broad categories: data intensive applications, large memory applications, applications with long runtimes, and pleasant parallelism. Table 1 describes each of the major stages in the pipeline in terms of the overall fraction of runtime, CPU intensity, memory requirements, and data requirements. By far, the largest share of runtime, and hence the analysis bottleneck, came from similarity computations; it also had the simplest set of infrastructure requirements, so we distributed that stage of the pipeline first.
Table 1. Resources required for various stages of metagenome analysis in MG-RAST.
Full data set
Full data set
Full data set
2.3 Wide-area computational orchestration
As we mentioned previously, we depended on SGE for analysis execution reliability and control. This function seemed simple enough to easily implement upon initial inspection, so we implemented a single centralized coordination entity called AWE to coordinate execution between multiple distributed systems. As we described in the previous section, the similarity analysis stage was our initial target. This stage is implemented using NCBI BLAST. It is a good candidate for distribution, both because of its large overall resource consumption, as well as because the computation uses a large fixed database that changes infrequently. The only input to this stage is a small query sequence, easily transmitted from AWE.
Several other systems provide similar capabilities, including Berkeley Open Infrastructure for Network Computing and Condor. In our case, we opted to implement AWE in order to model work units in a format specifically suitable for genome sequence data. For example, because of the fixed metagenome database used by MG-RAST, request deduplication can be implemented if the AWE can recognize previously executed work units. Because of the relative simplicity of this software, we opted to build a purpose built orchestration system, as opposed to adapting our workload to an off the shelf system.
Argonne Workflow Engine consists of a centralized set of python daemons that can communicate with clients via a RESTful interface. This approach is widely portable, as clients need only be able to perform hypertext transfer protocol requests to the server. The work queue, as well as results and statistics, is stored in a Postgres database. We have used Facebook's Tornado framework to build a lightweight and efficient set of front tier web servers.
Work units are stored in small chunks. When a client requests work, it includes a work unit size; this allows the client to size a request to the resources available. At this point, the small chunks of work are aggregated into a larger unit, and each of these chunks is marked with a lease, including start time and duration. The client, as long as it is running, can extend this lease. If work units reach the end of the lease, they are freed for allocation to other client requests. This mechanism provides basic reliability in the face of unreliable remote clients. Each client can run one or more of these work unit cycles, depending on the resources available. On a current commodity system, these work units typically take 10–15 min to complete.
Although the initial implementation was relatively straightforward, we found that we needed to get address a wider range of error conditions than we had previously encountered. SGE had handled many of these issues for us. This highlights one of the important lessons we learned in this process; most pipelines are still user level software and not system software. By implementing AWE, and moving to resources we did not control, we began dealing with a wider range of error conditions. Moreover, these errors happened in a distributed environment, making debugging considerably more difficult. Similarly, debugging distributed performance problems required more sophisticated information gathering than when MG-RAST ran on a single cluster.
After becoming accustomed to dealing with performance issues, we were able to retune our database, modify our work management reliability scheme, and scale a single backend up to 500 compute nodes without issue. Further optimizations can be applied when they are needed; the architecture is designed to scale among multiple backend machines. However, since our current approach has been sufficiently scalable for the compute resources we have access to, we have not pursued this issue.
2.4 The issue of confidentiality or data security in distributed platforms
Using a diverse set of heterogeneous computers might present security issues, if access an arbitrary subset of the data being analyzed is unacceptable to the end-user. In our use-case, access to subsets of DNA sequences without contextual metadata (see e.g., GSC maintained metadata types [7, 9]) will render the data useless. The risk of accidentally granting access to anonymous DNA sequence data without any provenance information (as being distributed by the mechanisms we implemented) is deemed acceptable for the application we describe.
However, to further increase anonymity, any existing sequence identifiers generated by sequencing equipment could also be rewritten to provide increase anonymity. In our application, we strip out data protected by Health Insurance Portability and Accountability Act before submitting sequence data to distributed resources.
2.5 A brief discussion of performance
Moving from an integrated local cluster environment to a more distributed type of work-load has allowed us to utilize up to 500 nodes, where in the past we were limited to much more limited locally available resources. We have been able to use available cycles from a number of machines including the Department of Energy Magellan cloud test bed located at Argonne National Laboratory, Argonne's Laboratory Computing Resource Center computer, and Amazon's commercial EC2 offering.
One surprising result is the fact that as data volumes transferred after the initial deployment are small, even a DSL type connection provides suitable throughout for our application model. Interestingly, results of sequence comparison tend to be significantly larger than the input data sets, they frequently are 10 times the size of the input sets. Here, using our domain knowledge, we have chosen to change the report format for BLAST to produce results using a with significantly reduced on-disk and data transfer footprint.
In adapting MG-RAST to cloud and ad hoc computational resources, we have developed a number of guidelines for determining how best to utilize such resources.
3.1 Characterize applications
Applications have a wide range of properties. Serial codes are well suited to clouds, whereas parallel codes are better suited to local clusters. Some applications are highly I/O intensive, limiting their performance on cloud resources, whereas others are not. Each of these characteristics determines how effectively remote resources can be used, and whether their use for the task is worthwhile. Similarly, cloud resources have prices associated with access; this information can be used to develop a cost model for analysis that combines the price performance of resources for the throughput achieved for a given application on that resource.
3.2 Consider portability
Without appropriate care, applications can embed assumptions about the software configuration of computational resources. These assumptions limit the portability of applications, restricting the ability to move computations from one facility to another. When resources become available, it is often advantageous to be able to make use of them quickly.
3.3 Prepare for scalability
The workload for a local computational resource is often very regular; a moderate number of resources will be available overall. In composite environments, including both local and remote resources, the aggregate availability of resources can quickly change. These spikes can cause scalability problems. Scalability issues should be considered during the design and tested with synthetic workloads prior to deployment, if possible.
3.4 Develop effective logging approaches
Many pipeline developers are familiar with application development in a localized setting. Cloud software is system software running in a distributed context. This means that telemetry, logging, and other debugging mechanisms are critical; information collected must be comprehensive to allow post hoc debugging of system issues. The collection of this data can pose scalability problems, so an organized analysis of logging as well as the determination of the log messages required to debug common issues should both be performed.
The explosion of sequencing capacity in the bioinformatics community has cause a large and ever-growing need for computational cycles. The bioinformatics community as a whole, as well as pipeline operators in particular, need to be prepared to make use of distributed resources in order to be able to satisfy the demand for data analysis.
In this paper, we presented our experiences in adapting MG-RAST to make use of distributed resources for its computation. This process was not a simple one; major alterations to our runtime infrastructure, operational models, and debugging processes were required. Although the cost of these changes was high, we feel that these adaptations were required in order to scale the current and future data set sizes and analysis demands.
On-demand architectures for computation have enabled us to perform analysis for larger data sets in larger quantities; however, this alone is not a panacea. Clouds have enabled us to determine the market costs of analysis, but these quantified costs have shown us that we can't financially afford to continue performing our analysis with the current generation of algorithms, even taking Moore's Law into account. More efficient analysis algorithms will be needed soon.
1Committee on Metagenomics: Challenges and Functional Applications. The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet, The National Research Council, 2007.
2 Tyson GW, Chapman J, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature2004; 428(6978): 37–43.
3 Venter JC, Remington K, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science2004; 304(5667):66–74.